This analysis examines covid-19 data to determine the growth of the virus in the United States across various states and counties. Here I utilize datasets of confirmed covid-19 cases and deaths alongside information about county demographics and state testing to predict the number of new covid-19 cases on a given day per county and to predict the number of total deaths per state. This analysis found that historical trends are important to understand the future number of covid-19 cases and that number of deaths, predicted off of number of people tested and confirmed cases, is a better indicator of the growth of covid-19 than the number of confirmed cases because of testing biases. Future research will examine percent change over time and incorporate more detailed hospitalization data to form a clearer picture of the covid- 19 crisis.
This analysis focuses on exploring and analyzing covid-19 data to determine the impact of the virus in the United States by identifying informative variables through EDA and using sklearn to build more complex linear models. Here, I examine cumulative and additional cases by state and percent change of cases to predict the number of covid-19 cases on 4/18/20 for each county and I analyze biases in the data before predicting total number of deaths per state. The main questions asked are:
The state dataset consists of 140 records of various states and provinces around the world and includes information regarding the number of confirmed cases, deaths, people hospitalized, people tested, etc. as of 4/18/20. The dataset has 18 features.
The covid_confirmed and covid_deaths datasets contain 3,255 records, one for each county in the United States and include the cumulative number of covid-19 confirmed cases and deaths respectively by county from 1/22/20 until 4/18/20. The confirmed cases dataset has a total of 99 features while the covid-19 deaths dataset has a total of 100 features.
The abridged counties dataset has 3244 records for the counties in the United States and has 87 features including information about the demographics of a specific county (population total, ages, poverty, mortality rates, health issues) as well as information on lockdown dates and social distancing.
The last dataset used is the daily dataset, which has 3769 records and 27 features. This dataset provides daily information for each state about positive and negative cases, hospitalization, the number of people in the ICU or using a ventilator, and the amount of testing.