Taming COVID-19 Statistics to Reflect Happiness Score Metrics

Permissions

Place an X in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

Overview

In responding to the global pandemic caused by COVID-19, most countries have implemented different strategies to protect its people's health. Their attitude in responding this pandemic may partially represent the government's general attitude to its people. The efforts in vaccinations are a common and leading protection strategy adapted by most countries. In this project, we make use of the vaccination/death data by country, and explore the relationship between the set of {nation-wide onset date of vaccination, average new vaccination/death rate across different time spans}, and various metrics of happiness score in 2021. In particular, we are trying to figure out to what extent our independent varibles, namely all COVID-19 related data, are correlated to these metrics (i.e. social support, healthy life expectancy, perception of corruption, and generosity) of happiness scores. Throughout our careful analysis, we yielded the conclusion that while some of the metrics don't significantly correlate with COVID data, such as perception of corruption and generosity, there does exist a significant correlation for social support and healthy life expectancy against our COVID-19 data representation. Further, in terms of COVID-19 data themselves, a higher average new vaccination time doesn't necessarily bring a decreasing trend in average new death rate globally, and an early onset of vaccination shows a negative correlation versus death rate in developed countries, and ironically a positive correlation versus death rate in developing countries.

Names

Research Question

How the onset of vaccination, the vaccination rates, and the death rate in a country are related to the happiness score of that country?

Background & Prior Work

In response to the rapid spreading of COVID-19, many countries have implemented social distancing, mandatory quarantine, and reopening guidance measures to reduce the rate of transmission. Therefore, many people’s lifestyles have been changed greatly due to the restricted rules enforced by the government. They were forced to work remotely from home, limited to certain activities, and lived in fear of the virus. The skyrocketed infection rate and death rate not only tolls on people physically but also mentally. In addition, the resulting economic recession from the COVID-19 pandemic negatively impacted people’s mental health. People experience feelings of stress, frustration, anger, anxiety, or loneliness. According to the recent research of “The Implications of COVID-19 for Mental Health and Substance Use” (reference 1), it showed that 4 in 10 adults in the U.S have reported anxiety and depressive disorder.

As such, everyone in the world is waiting for a vaccine to defeat the COVID-19 pandemic, hoping it will bring life back to normalcy. Hence, we are curious to investigate the relation between these three factors: onset of the vaccination, increasing rate of the vaccination, and the death rate, with the different metrics of happiness score (social support, healthy life expectancy, generosity, and percepections of corruption) of the countries. We want to discover how the degree of importance a country attaches to vaccines and the speed at which they are popularized will affect the death rate of the COVID-19 and therefore impact the people's happiness level of the country.

There have been studies on the relationship between the COVID-19 pandemic and the different metrics of happiness score of the country. The World Happiness Report conducted research on “Happiness, trust, and deaths under COVID-19” , which compares the overall life evaluations and measures of positive and negative emotions of different countries in 2020 with the data for 2017-2019 before the COVID-19 (reference 2). The research reported life evaluation scores in that country based on its average income, life expectancy, and four social factors. The researchers have compared the rankings of happiness (average life evaluation) in different countries based on the 2020 survey compared to those in 2017-2019, and the study found that COVID-19 has led to a modest change in the overall rankings (reference 2). There is an insignificant difference in the overall average happiness score from 5.81 in 2017-19 to 5.85 in 2020 (reference 3).

Even though the study revealed that there is no significant difference in the overall happiness score before the COVID-19 and after it, it didn't investigate how the happiness score fluctuates with the onset of the vaccination, the increasing injection rate of the vaccination, and the death rate of the COVID-19. The goal of our project is to use the data with a detailed time frame to investigate how the happiness score changes over time along with these three crucial factors. Through analyzing and visualizing the data with time spans, we hope to determine whether there are clear relationships between these three variables and the happiness score, and therefore further investigate how they are related.

References (include links):

Hypothesis

Hypothesis:

An earlier onset and a faster increasing rate of vaccination, which possibly correlates with a lower death rate over time, result in a potentially higher happiness score.

Defense:

Dataset(s)

We are planning on joining the death dataset with the vaccination progress and population dataset. Then we calculate the rate of death and vaccination at a certain time interval to summarize this time-series data by each country. Finally, with the summarized data, we join it by the world happiness report to do EDA and analysis for possible correlations.

Setup

First, let's configure the IPython notebook display and import (install) all dependencies required to run this notebook. If some packages are missing, pip will automatically install them. After the pip install is done, restart the kernel, rerun the cells in the setup, and you are ready to go! The setup should take longer time to finish the first time you run it due to package installation and dataset download. The dataset will be cached locally so you won't need to download them again after the first time.

The below script will get you all the dataset needed to run this notebook. Make sure you have your kaggle.json ready (see README.md for more details).

Data Cleaning

For the first dataset, we decide to extract the information related to how people are vaccinated and the death information in each country. total_vaccinations, people_vaccinated, people_fully_vaccinated, new_vaccinations, and new_vaccinations_smoothed decribe the vaccinated population in each country, while total_deaths and new_deaths decribe how many people die each date and the death increase. For the second dataset, Country name is kept for merging the datasets. We select Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption because these are resonable metrics of happiness within these countries. Country (or dependency) is kept for merging for the third dataset. We keep Population (2020), Density (P/Km²), and Med. Age as a description of the overall population in a particular country. We started with the happiness dataset and we merged it with the population dataset, because our target variable is happiness. Using the merged result, we then further merged it with the vaccination dataset.

Since our research question is "How the onset of vaccination, the vaccination rates, and the death rate in a country are related to the happiness score of that country", we need to determine the onset dates for each country before doing further analysis.

As a result, we have decided to use the first date when each country has a non_na value of new_vaccinations_smoothed given the fact that this column has the least missing values. To make it happens, we have only selected the columns of date, location, and new_vaccinations_smoothed. We saved it to a temporary DataFrame and groupby it by the location. This is because we are interested in the onset date for each country.

After observing the dataset, we have found out that if the country has not had any person getting vaccinated, the csv file use NaN for the missing values. Thus, the first date with non-NaN value is the major vaccination date. We have applied a lambda function to drop all the NaN values and find the date that has the minimun value (the earliest date which is the onset date).

We have saved those dates into a dictionary where the key-value pairs are the countries and their corresponding onset date.

In order to calculate the mean of the vaccination rate cross {7, 15, 30, 60, 90, 180, and 360} days, we tried to find the onset date of the vaccinations in each country and calculate the average number of vaccinations injected since the onset date.

Therefore, we initially set the onset date in each country to the first date of new_vaccination(daily vaccination) data available.

However, we notice that there are exceptions. For instance, in Denmark, this country only had 2 vaccinations in the first approximately 15 days and had over 1000 daily vaccination in the next few days. We can see that the onset date of the vaccination is not the actual date the vaccinations were given to the public. Therefore, it is necessary to make a more inclusive definition for the onset date so that we can better deal with these exceptions.

As a result, we calculated the ratio of new_vaccinations increasing rate for each day. If the number of the new vaccinations has over 100 times of the one in the previous date, then we define this country as an exception which possibly has some anomaly in terms of new_vaccinations.

After taking closer look to the above countries in the original data table, we have found that Jamaica has a huge number of vaccination during its onset date because it only has few records. As a result, we see it as an outlier that could be discarded by our analysis. Thus, we had to remove Jamaica from our onset_date dictionary.

On the other hand, we have reset the onset date for those countries that showed a significant gap between two dates of new vaccinations (over 100 times of increasing). We reset the onset date to ensure that the onset date is the representation of when the vaccinations is carried out to the public.

To further explore the vaccination increase for each country, we implemented the following design such that our vaccination and death data is grouped by countries, and then for each country, the average of daily vaccinations and death rate for {7, 15, 30, 60, 90, 180, 360} days starting from the onset day of the vaccination rate is taken.

Data Analysis & Results

EDA

*Note 1: Many Visualizations are animated to demonstrate change in pattern as a matter of various time intervals, and some can be interacted. For best visualization and interaction effects, it's strongly recommended to run the notebook in an ipython environment. Alternatively, you may find *.gif inside the visualization directory which are saved versions of the animated visualizations demonstrated in this notebook.

*Note 2: Jupyter Notebook doesn't work well with matplotlib on notebook mode. In particular, the notebook may compress some output visualization cells into scrollable format, which negatively impacts viewing experience. We've tried to programmatically change this behavior, but it still happens sometimes. When it occurs, select the cell. Then in the menu, click the following sequence: Cell -> Current Output -> Toggle Scrolling

Most countries that are available in the geopandas package are taken into account in our model. There are countries that are excluded from our model, indicated by the light blue color on the map, most of them are developing third-world countries in Africa.

The possible reasons might be the fact that these countries don’t have the resources to collect data or these data aren’t publically available or they just don’t have a sufficient population base. Because these countries are mostly developing ones, they might include a higher death rate and a lower vacciantion rate. In the scope of analysis, we need to keep in mind about the potantial biases and assumptions we made here. When proceeding with the analysis and the discussion, we’ll also make sure that our analysis doesn’t discriminate against those countries.

*The above figure is static when the code cell is properly run.

By plotting the overall distributions of various happiness metrics for social support, healthy life expectancy, perception of corruption, and generosity, we could have a better idea of what the distributions look like by visualizing them. In our graph, we can observe a strong left-skewed distribution for both Social Support and Perceptions of Corruption, a left-skewed distribution for Healthy Life Expectancy, and a roughly normal distribution of generosity (center around 0). People in most countries are satisfied with their social support, a life expectancy above 65. But they tend to believe that their countries have a high degree of corruption and a neutral generosity.

*The above figure is static when the code cell is properly run.

In the below figure, the top-left graph displays the overall distribution in average new vaccination rates across different time spans (i.e. 7, 15, 30, 60, 180, 360 days). But the graphs seem to be strongly left-skewed and display an exponentially decreasing trend. Since most countries started from a small or zero vaccination population, we see that vaccination rates are clustered at a lower value from the first seven days. The top right graph is the visualization of the average new death rate across different time spans. We also observed a low death rate at the beginning and followed an increasing trend in the death rate over the next several time spans. To make the distribution more symmetric, we take the logarithm of both the average new vaccination rate and death rate. We can clearly see that both the average new vaccination rate and death rate have been transformed to rough normal distributions.

Thus, we are able to perceive the change of new vaccination rate and death rate more clearly. By observing the distribution change across different time spans (i.e. 7, 15, 30, 60, 180, 360 days), we can see that the average new vaccination rate and average new death rate gradually shift from left to right. The logarithm of the new vaccination rate lost its symmetry and became right-skewed where it gradually shifted to higher vaccination rates since the majority of the countries have started to implement vaccination on a large scale. The new death rate logarithm model clearly displayed an increase in death rate as it also became right-skewed. Despite the increasing vaccination rate across the time, the potential reason for the increasing death rate is because of the widespread and transmission speed of the disease.

*The above figure is animated when the code cell is properly run.

In the below figure, we want to see the trend of the onset data of vaccination of each country with their corresponding happiness score metrics. Since we have considered 4 types of metrics for happiness score, there will be 4 corresponding scatter plots with least-squared regression lines with respect to each type of metric. If we look closer at each graph, we observe a slight positive slope between the perception of corruption and the number of days since the earliest onset & generosity and the number of days since the earliest onset, a moderate negative slope between social support and the number of days since the earliest onset & between health-life expectancy and the number of days since the earliest onset. This roughly shows that the earlier the countries get vaccinated, the higher the happiness score they have. This is true across all four types of happiness score measurement metrics.

Moreover, we have a corresponding scatter plot for each scatter plot that has a least-squared regression line. In each of those scatter plots, we have used different colors to represent the geography of each country so that we could have a better sense of the order of onset date for each cluster of countries. In the plots, we can see that the countries from Europe, North America, and East Asia are the countries that have earlier onset data of COVID-19 vaccination. This makes sense because the United States (from North America), the United Kingdom (Europe), and China (East Asia) are the earliest countries that have started to develop vaccines. We can also see that the developed countries from Europe, who are also the earliest countries that have started to get vaccinated have low scores of sense of corruption. As a result, we can use our previous knowledge to intuitively match our data and verify our conclusion makes sense.

*The above figure is static when the code cell is properly run.

We want to further investigate the correlation between happiness metrics on perceptions of corruption, social support, healthy life expectancy, and generosity, and average new vaccination rate over time. However, there is no clear linear pattern between all the four happiness metrics and the average new vaccination rate in the 7-day time span. We gradually observe a linear pattern between all the four happiness metrics and the average new vaccination rate in the following time span measurements. Similar to the linear pattern in the graph above, we observe a positive slope between the perception of corruption and the number of days since the earliest onset & generosity and the number of days since the earliest onset, a negative slope between social support and the number of days since the earliest onset & between health-life expectancy and the number of days since the earliest onset. Both the positive and negative correlations get stronger as the time span for the average new vaccination rate measurement increases. This somehow intuitively makes sense. As the time span increases, our new vaccination rate data can better represent a country’s vaccination condition in the long term. (There could be a huge new vaccination in the first few days for countries with a small vaccination population. This sharp increase will be normalized if we have a longer time span). Thus, the relationship will be more clear.

*The above figure is animated when the code cell is properly run.

We want to next investigate the correlation between happiness metrics on perceptions of corruption, social support, healthy life expectancy, and generosity, and average new death overtime. However, there is no clear linear pattern between all the four happiness metrics and the average new vaccination rate across all time spans (7, 15, 30, 60, 180, 360 days). Although we can’t draw a clear regression line in all distributions across the time spans, it is still possible to observe a slight linear trend. We observe a slight positive slope between the perception of corruption and the average new death rate, social support, and the average new death rate, & health life expectancy, and the average new death rate. We also observe a slight negative slope between the generosity and the new death rate. We discover that the death rate is associated with high healthy life expectancy and social support. This is because the disease was most widespread in countries in North America and Western Europe in the early days of the covid-19, the places with relatively high healthy life expectancy and social support scores. Therefore the massive death rate in North America and Western Europe caused by Covid-19 brings the overall death rate with high healthy life expectancy and social support scores.

*The above figure is animated when the code cell is properly run.

The below graph displays the relationship between the average new vaccination rate and death rate at different time intervals. It is clear that the new vaccination rate has been increasing across the time intervals since the countries started to gradually implement vaccination on a larger scale across the time intervals. Although the vaccination rate shows an increasing trend, the death rate experiences no visible changes.

*The above figure contains interaction when the code cell is properly run.

Finally, if we break down the death rate by regional indicator, and use number of days since the onset of vaccination, we found a thoughtful results, regions with mostly developed countries show a decreasing trend - that is, the longer the vaccination went, the lower the average new death rate was; on the other hand, regions with mostly developing countries show show an increasing trend in average new death rate. The reasons for the trend are unknown given the data we have, and possible explanations of this trend are discussed in later part of the project. This diverging results explain the reason why death overall (i.e. globally) as we moved forward from 7 days since onset all the way to 360 days since onset (data from different regions balance out each other).

*The above figure contains interaction when the code cell is properly run.

Statistical Modeling

Now, after we visually analyzed relationships between our desired variables, we wanted to statistically establish some conclusions. Since we observed a similar scatter plot trend between all happiness metrics v.s. different time spans, we decided to perform a Principal Component Analysis (PCA) on average new vaccination and death rate to achieve dimension reduction (and thus attempting to exclude the effect of multicollinarity on our model). According to the cumulative explained variance plot below, it shows that around 95 percent of the variance can be explained by one component of the New Average Death Rate and two components of the New Average Vaccination Rate. Thus, we can reduce the dimension of 7 for the New Average Death Rate to 1 and reduce the dimension of 7 for the Average New Vaccination Rate to 2.

By implementing the dimension reduced to our DataFrame, we are now able to use only four features:

We will then analyze their correlation to the happiness metrics and proceed to the modeling steps. For modeling, we used $\alpha = 0.01$ for significance.

We build an OLS regression model to find the correlation between the independent variables: VRD1, VRD2, DRD1, and PROMPTITUDE with the dependent variable: Social Support. We see that the R-squared score is moderately high with $$R^2 = 0.545$$ This means that 54.5% of the variation of Social Support could be explained by these independent variables.

In the summary, we can see that the p-values for VRD1, VRD2, and DRD1 are all statistically significant given they are all less than $$\alpha = 0.01$$ However, the PROMPTITUDE displays less statistical significance with a p-value of 0.258 when the significance level is low. This is possibly due to the fact that all other factors are explained well.

We build an OLS regression model to find the correlation between the independent variables: VRD1, VRD2, DRD1, and PROMPTITUDE with the dependent variable: Healthy_life_expectancy. We see that the R-squared score is the highest among all the OLS models with $$R^2 = 0.677$$ This means that 67.7% of the variation of Healthy_life_expectancy could be explained by these independent variables.

In the summary, we can see that the p-values for VRD1, VRD2, and DRD1 are all statistically significant given they are all less than $$\alpha = 0.01$$ However, the PROMPTITUDE displays less statistical significance with a p-value of 0.088 when the significance level is low.

We build an OLS regression model to find the correlation between the independent variables: VRD1, VRD2, DRD1, and PROMPTITUDE with the dependent variable: Perceptions_of_corruption. We see that the R-squared score is the lowest with $$R^2 = 0.114$$. This means that only 11.4% of the variation of Perceptions_of_corruption could be explained by these independent variables.

In the summary, we can see that the p-values for VRD1, VRD2, and PROMPTITUDE are all NOT statistically significant given they are all greater than $$\alpha = 0.01$$ However, the DRD1 displays statistical significance with a p-value of 0.003.

We build an OLS regression model to find the correlation between the independent variables: VRD1, VRD2, DRD1, and PROMPTITUDE with the dependent variable: Generosity. We see that the R-squared score is similar to the R-Squared score above with $$R^2 = 0.124$$ which corresponds to a weak correlation that only 12.4% of the variation of Generosity could be explained by these independent variables.

In the summary, we can see that only DRD1 is statistically significant given the fact that the p-values for the rest of the independent variables are all greater than $$ \alpha = 0.01$$ Thus, we can conclude that there isn’t a correlation between VRD1, VRD2, and PROMPTITUDE and Generosity.

To validate the hypothesis of the earlier onset and a faster increasing rate of vaccination, which possibly correlates with a lower death rate over time, we build an OLS regression model to find the correlation between the independent variables: VRD1(vaccination rate) and PROMPTITUDE (days since first vaccination onset date) with the dependent variable: DRD1 (death rate). We see that the R-squared score is moderately high with $$R^2 = 0.322$$ which corresponds only 32.2% of the variation of Generosity could be explained by these independent variables.

In the summary, we can see that the p-value for PROMPTITUDE is less than $$\alpha = 0.01$$ This means that it is statistically significant to conclude that there is a correlation between PROMPTITUDE and DRD1.

However, the p-value for VRD1 is greater than $$\alpha = 0.01$$ Thus we can’t conclude that there exists a correlation between VRD1 and DRD1.

Given that the R-squared of the OLS model using healthy life expectancy is the highest of all, we decide to let this metric become the “happiness score” that we have been proposing at the beginning of the project. Therefore, we want to build our regression model based on this metric. Considering that our four independent variables: VRD1, VRD2, DRD1, and PROMPTITUDE are not in the same magnitude and the fact that they contain outliers (as illustrated in EDA), we decide to standardize these features using RobustScaler, which scales the data according to the quantile range with 1st quartile (25th quantile) and the 3rd quartile (75th quantile). We selected SVR, LinearRegression, and RandomForestRegressor as our models, and each of which represents a unique paradigm to do a regression task. SVR is a support-vector based regressor that is able to map the original features into higher dimensions and finds a linear function that representing data in a margin of error; Linear regression is a simple linear model that uses an analytical solution to find model parameters based on training data (i.e. X@w=Y); on the other hand, random forest regressor is a rule-based model that tries to find various numbers of decision trees, and each of which minimizes the entropy on a subsample of training data. Our experimentation results show that the RandomForestRegressor gives us the best R-squared score of 0.746, where as Linear Regressor or SVR did not perform as well as the Random Forest Regressor. However, since our training data and validation data are both small in size, these results might still be unstable and could differ if a different split strategy is adapted or more data is used (unfortunately, there are only limited numbers of countries in the entire world, so it’s essentially a few shot learning problem, and we believe that it’s unethical to train such a model with arbitrary generated/augmented data).

Latent Variable Modeling

Because we observed in the EDA that regional indicator is a key factor in determining an approximate range for days since earliest onset (i.e. promptitude) and healthy life expectancy (see figure 4), we believe that some underlying variables exist such that data points belonging to any underlying (i.e. latent) variable follows some paramterized distributions. Here, we make the assumption that these latent variables are all parameterized by gaussians (there are better ways to approximate a parameterized distribution, but due to the scope of the class, we think assuming them to be gaussians is sufficient to make the case, even for extra credit). Simply saying, we wonder if we can mine the regional indicators without explicitly using them.

Therefore, we will use a Gaussian Mixture Model to fit and predict the cluster labels of our data. To validate that these cluster labels actually do resemble regional indicators, we used normalized mutual information score to make comparisons on comparing the mutual information score between regional indicator labels and clustering labels, and the mutual information score between regional indicator labels and randomly assigned labels. Results show significantly higher normalized mutual information score when it comes to comparing clustering labels with actual regional indicator labels in any numbers of components; the highest normalized information score achieves at 6 components. At this stage, we are sure that using a gaussian mixture model does approximately cluster data points with regard to their respective regional indicators, but we need to dig a deeper into the results and see why 6 components work better than 10 components (the regional indicators have a total of 10 distinct categories). To analyze this, we used the TSNE to visualize the difference between clustering labels and actual regional indicator labels.

This TSNE result gives as sufficient insights into why having 6 components work better than having 10 components in terms of normalized mutual information score. As we can see, almost all data points assigned to cluster 1 belong to either Western Europe and Central and Eastern Europe (with two exceptions, one from Southeast Asia, and the other from North America and ANZ). On the other hand, almost all data assigned to cluster 0 mostly belong to Middle East and North Africa and Sub-Saharan Africa. Other clustering labels don't show significant correlations with specific regional indicators, but cluster 0 and 1 are themselves sufficient to show that these gaussians do reconstruct information in terms of regional indicators.

Now, after we know the fact that using a gaussian mixture model for our independent variables (i.e. VRD1, VRD2, DRD1, PROMPTITUDE) gives us underlying information on regional indicators, we wonder if the addition of thse clustering labels improve the performance of our models or not. As such, we decide to one-hot encode the clustering labels from gaussian mixture results and add these encodings as part of the data input. The below results and the figure show that the addition of latent variable modeling improves all three models we have used previously. For SVR and Linear Regression, each clustering label serve as a unique intercept for that serves all data points having that label, and for Random Forest Regressor, each clustering label serves as a decision that helps each tree in the model further reduce the entropy and makes better decisions.

Based on the EDA and the above statistical analysis results, we derive the following key takeaways:

  1. Globally, a faster increasing rate of vaccination doesn't necessarily correlate with a lower death rate over time. As we see from figure 7 (the boxplot), the new death rate doesn't show a decreasing trend overall when we increase the time span.
  2. Locally, an earlier onset date of vaccination brings a decreasing trend of new average death rate over time more likely on developed regions, while ironically, an increasing trend of new average death rate over time is observed on developing regions. While it's intuitive to understand the trend happening in developed regions, we believe the reason that the trend is reversed in developing countries is because the virus transmitted to these areas in large scale at a later time, and the statistics, especially in terms of calculating infection and death rate, were not accurate enough when the virus just broke out. These two reasons possibly caused the reversal of correlation when it comes to developing countries. As a future scope, more works are needed to explore the authenticity of such claims.
  3. Among various "happiness" metrics, social support and healthy life expectancy show high correlation with our variables (i.e. First two principal components of average new vaccination rate, first principal components of average new death rate, and promptitude), while generosity and perception of correpution show low correlation with our varibles.
  4. Among happiness metrics that show significant correlations, the major features being used by the model in terms of p-values come from average new vaccination rate and average new death rate. An insignificant p-value is found for promptitude (i.e. days since earlist onset) for the model under our context.

Ethics & Privacy

Data collection

Data storage

Analysis

Modeling

Reproducibility and replicability

Conclusion & Discussion

Here we would like to reinforce our question:

How the onset of vaccination, the vaccination rates, and the death rate in a country are related to the happiness score of that country?

And the hypothesis:

An earlier onset and an increasing rate of vaccination, which possibly correlates with a lower death rate over time, result in a potentially higher happiness score.

We would like to make a comprehensive conclusion based on all the above analysis: while an increasing average new vaccination rates don't generally correlate with a decreasing average new death rates, the onset of vaccination does have a negative relationship with the average new death rates in developed countries and a positive relationship with the average new death rates in developing countries. When all three factors are taken into account, they have high correlation with two metrics of happiness score, namely social support and healthy life and expectancy, and low correlation with the other two metrics of happiness score, namely generosity and perception of corruption. We cannot derive a conclusion from the coefficient of the variables in our analysis process because independent variables have gone through PCA and standardization, which makes hard to interpret the actual meaning of coefficient. However, we can possibly still be able to observe from figures in the EDA section that the average new vaccination rates and the average new death rate positively correlate with two metrics of happiness score that have a higher correlation with all our indepdndent variables, and the number of days since the country that has the earliest onset has a negative correlation with these two highly-correlated metrics of happiness score.

We would also talk about the limitations in this project. In particular, we did not take the effects of different COVID-19 varients, different brands of vaccines, different degree of general precautions for protecting themselves against the virus, and different statistical precision in counting the number of vaccinations and deaths each day among different countries. These four differences can strongly bias our conclusion and yield some unexpected results. For instance, if a developing country had a poor surveillance system in counting the number of deaths caused by COVID-19 initially but improved significantly in the later, we would likely to see an increasing trend in average new death rate as we increase the time span despite that we also see an increasing trend in average new vaccination rates. On the other hand, if a country started vaccination before the virus gets massively transmitted, but later the virus exponentially transmitted, we will also likely to see the same pattern (i.e. a positive correlation with average new vaccination rate and average new death rate across different time spans). Finally, we assumed that countries reported their data in vaccination statistics and death statistics regarding COVID-19 are faithful. In reality, there still exists the probablity where governments report fake data to the WHO for some reasons, which could also bias our conclusion. Again, our conclusion was based on the analysis of all data we collected from Kaggle - if more factors are taken into account and the data is more faithful, it's yet not unexpected to possibly see an opposite conclusion.

Team Contributions

We don't specify teammates to concentrate on specific sections. That said, all of us contribute to all sections, and all of us work together for every checkpoint. Each of us has considerable efforts in building this project.

Appendix: Team Expectations


Appendix: Project Timeline

Meeting Date Meeting Time Completed Before Meeting Discuss at Meeting
10/20 5 PM Read & Think about COGS 108 expectations; brainstorm topics/questions Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research
10/21 5 PM Do background research on topic Discuss ideal dataset(s) and ethics; draft project proposal
10/23 10 PM Edit, finalize, and submit proposal; Search for datasets Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part
11/15 6 PM Import & Wrangle Data; EDA Review/Edit wrangling/EDA; Discuss Analysis Plan; Submit Checkpoint #1: Data*
11/19 5 PM Finalize wrangling/EDA; Begin Analysis Discuss/edit Analysis; Submit Checkpoint #2: EDA*
11/30 5 PM Complete analysis; Draft results/conclusion/discussion Discuss/edit full project
12/1 5PM Finalize project Turn in Final Project & Group Project Surveys