HealthAnalyticsToolkit

COVID-19 Analysis

First, the correct libraries need to be imported. This tutorial utilizes the following libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Import the Data

The file used in this tutorial can be found here. The information is regularly updated so if you get slighly different numerical results, do not be concerned. Data is stored in a csv file. To read the data, we can use pandas.read_csv. This reads a comma-separated values (csv) file into a DataFrame.

data = pd.read_csv("state_policy_updates_20210221_0722.csv") #Read csv
data
state_id county fips_code policy_level date policy_type start_stop comments source total_phases
0 HI NaN NaN state 2020-05-18 Manufacturing start Policy_Details: Open with adjusted "Safe Pract... sip_submission_form: https://governor.hawaii.g... NaN
1 TX Kinney 48271.0 county 2020-07-03 Mask Requirement start Policy_Details: County is approved to be exemp... sip_submission_form: https://tdem.texas.gov/ga... NaN
2 ID Custer 16037.0 county 2020-10-27 Phase 3 start Policy_Details: No greater than 50 people at i... sip_submission_form: https://coronavirus.idaho... 4.0
3 UT Wayne 49055.0 county 2020-11-24 Phase 1 start Policy_Details: Restrictions for highest level... sip_submission_form: https://coronavirus.utah.... 3.0
4 IL Ford 17053.0 county 2021-01-15 Food and Drink start Policy_Details: Bars and restaurants: Open wit... sip_submission_form: https://coronavirus.illin... 5.0
... ... ... ... ... ... ... ... ... ... ...
4175 TX Morris 48343.0 county 2020-07-03 Mask Requirement start Policy_Details: County is approved to be exemp... sip_submission_form: https://tdem.texas.gov/ga... NaN
4176 ME Piscataquis 23021.0 county 2020-05-18 Outdoor and Recreation start Policy_Details: Rural Reopening : wilderness ... sip_submission_form: https://www.maine.gov/cov... 3.0
4177 NY NaN NaN state 0020-06-08 Resumed Elective Medical Procedures start Other measures and details for this policy inc... BU COVID-19 State Policy Database NaN
4178 WV NaN NaN state 2020-03-19 Food and Drink start Other measures and details for this policy inc... BU COVID-19 State Policy Database NaN
4179 IL Lee 17103.0 county 2021-01-15 Non-Essential Businesses start Policy_Details: - All employees return to work... sip_submission_form: https://coronavirus.illin... 5.0

4180 rows × 10 columns

Clean the Data

After reading the data in, we want to clean the data. In this dataset, some of the dates state 0020 instead of 2020 for the year. Additionally, one data point has 1899 as the date. Being that COVID is more recent, that year is assumed to be 2020 as well. The following code locates and fixes the errors in the dates in the data set.

dates_to_change = []
for index, row in data.iterrows():
  if "0020" == row.date[0:4]:
    data.loc[index, "date"] = "2020" + row.date[4:]
  elif "1899" == row.date[0:4]:
    data.loc[index, "date"] = "2020" + row.date[4:]

Th above for loop iterates through each row in the DataFrame stored in data. With each row, the if/elsif statements check to see if the first four digits of the date are either “0020” or “1899” with the equal-to operator (==). If either of the error values are present, the values are replaced with “2020”.

Analysis

Initial Investigation of Policies

To begin the investigation, let us visualize the data in a scatter plot. The below code does just that in the following steps:

  1. Sort the values by the “state_id” column.
  2. Using a mask, find and store all the rows that are labeled as “start” in the “start_stop” column in a variable called “start_scat”
  3. Create a scatter plot with the data in “start_scat”. Put the values from the “date” column on the x-axis and “policy_type” on the y-axis. The color of the dot on the scatter plot will be assigned by the “state_id”, meaning that each of the dots associated with a given state will be the same color. .
sorted = data.sort_values(by=["state_id"]) #sort the data by state
start_scat = sorted[(sorted.start_stop=="start")]
px.scatter(start_scat, x="date",y= "policy_type", color = "state_id", title = "Policies by State and Date")

Scatter Plot

The initial scatterplot provides a crude visualization of how start of policies in timeseries, color-coded by state. There is not much one can tell from this graph aside from the fact that the implementation of any given policy varied widely in time.

Investigating the Number of Times Each Policy Starts

The code below finds the three most started policies, by counting the number of times that policy was started. This means if that policy was started mulitple times in a single state, it would be counted each time.

start_data = data[(data.start_stop=="start")] #Look only at the start of policies
policies = start_data.policy_type.unique() #Find all the unique policies
count = [] #Create an empty count vector

for val in start_data.policy_type.unique(): #Create a for loop to go through each unique policy 
  ope = start_data[start_data.policy_type == val] #using a mask, find all the policies that match the unique policy
  siz = ope.shape #find out the number of rows that have that policy
  count.append(siz[0])
  
d = {"policy":policies, "num_starts":count}
results = pd.DataFrame(d) #create results dataframe
results.sort_values(by=["num_starts"], ascending = False) #sort the data from largest to smallest number of starts
policy num_starts
10 Shelter in Place 462
4 Food and Drink 263
16 Outdoor and Recreation 242
1 Mask Requirement 239
13 Non-Essential Businesses 226
... ... ...
44 Medical 1
38 Colleges & Universities 1
42 Graduation Ceremony guidelines 1
41 State of Emergency/Funds 1
64 Agriculture 1

65 rows × 2 columns

By printing the dataframe, one can see the policies with the most starts include: Shelter in Place with 462 starts, Food and Drink with 263 starts, and Outdoor and Recreation with 242 starts.

Investigating the Widespread Use of Policies

The code below looks the number of states/territories that implemented each policy. This is interesting to see how widespread the implementation of certain policies are and to determine their reach across the nation.

count = [] #reset count to 0
for val in start_data.policy_type.unique(): #loop through each unique policy
  poli = start_data[start_data.policy_type == val] # poli hold the dataframe rows for the unique policy 
  ct = 0 #set a counter variable to 0
  for val2 in poli.state_id.unique(): #loop through each unique state within poli
    ct = ct + 1 #increment count for each unique state
  count.append(ct) #Add the number of states to the counter list

results["num_states"] = count #Add to the results dataframe
results.sort_values(by=["num_states"], ascending = False) #sort the data by state
policy num_starts num_states
13 Non-Essential Businesses 226 54
4 Food and Drink 263 54
12 Entertainment 196 53
5 Childcare (K-12) 193 53
16 Outdoor and Recreation 242 52
... ... ... ...
38 Colleges & Universities 1 1
36 Graduation 1 1
11 Wholesale Trade 5 1
27 Public Health Advisory System 27 1
64 Agriculture 1 1

65 rows × 3 columns

The policies with the most widespread implementation include: Non-Essential Businesses with 54 states/territories starting that policy, Food and Drink with 54 states/territories starting that policy, Entertainment with 53 states/territories starting that policy, and Childcare (K-12) with 53 states/terrirtories starting that policy.

Visualization of Policy Implementation Across the US

sorted_by_state = results.sort_values(by=["num_states"], ascending = False) #sort the data by state
plt.figure(figsize=(25, 5))
plt.bar(sorted_by_state.policy,sorted_by_state.num_states)
plt.xticks(rotation = 'vertical')
plt.ylabel('Number of States Starting Policy')
plt.xlabel('Policy')
plt.title("Policy Implementation Across US: Number of States Starting Policy vs Policy")

Bar Plot

What about Mask Requirements?

In as early as April, the CDC recommended masks to prevent the spread of the virus. However, interestingly, masks requirements do not make the top of the list for the number or starts or for being a widespread policy across the nation. This seems to be a major gap in slowing/stopping the pandemic!

To further understand, we want to look at which policies involve masks and further investigate them.

num = 0; 
for index, row in results.iterrows():
  if ("Mask" in row.policy):
    num = num+1  

The above for loop iterates thorugh each row in the DataFrame. If the work “mask” appeared within the policy name of a given row, the counting variable “num” was increased by 1.

After completion of this loop, num equals 3. There are three different mask related policies: “Mask Requirements”, “Mandate Face Mask Use By All Individuals In Public Spaces”, and “Mandate Face Mask Use By All Individuals In Public Facing Businesses”, all of which we will look at below.

mask_req = results[results.policy == "Mask Requirement"]
mask_public = results[results.policy =="Mandate Face Mask Use By All Individuals In Public Spaces"]
mask_bus = results[results.policy =="Mandate Face Mask Use By All Individuals In Public Facing Businesses"] 
mask_req
policy num_starts num_states
1 Mask Requirement 239 41
mask_public
policy num_starts num_states
14 Mandate Face Mask Use By All Individuals In Pu... 38 38
mask_bus
policy num_starts num_states
24 Mandate Face Mask Use By All Individuals In Pu... 46 46

Mask Requirement had the highest number of starts with 239 starts and 41 different states. Mandate Face Mask Use By all Individuals in Public Facing Businesses had the next highest number, with 46 starts and 46 states. Finally, Mandate Face Mask Use By All Individuals in Public Spaces had 38 starts and 38 states.

With these numbers, the next question is: How many states implemented any of the Masks policies? The following code investigates that question.

data_copy = data.copy() #Create a copy of the data

#Loop through and change any policy type involving masks to be called Mask in the copy of the data
for index, row in data_copy.iterrows():
 if ("Mask" in row.policy_type):
       data_copy.loc[index, "policy_type"] = "Mask"

data_copy = data_copy[(data_copy.start_stop=="start")] #Look only at the data that is a "start"

count = [] #reset count to 0
for val in data_copy.policy_type.unique(): #loop through each unique policy
  poli = data_copy[data_copy.policy_type == val] # poli hold the dataframe rows for the unique policy 
  ct = 0 #set a counter variable to 0
  for val2 in poli.state_id.unique(): #loop through each unique state within poli
    ct = ct + 1 #increment count for each unique state
  count.append(ct) #Add the number of states to the counter list

policies2 = data_copy.policy_type.unique() #Find all the unique policies

w = {"policy":policies2, "num_states":count}
results_mask = pd.DataFrame(w) #create dataframe

results_mask[(results_mask.policy == "Mask")] #Look at the number of states that implemented a mask related policy
policy num_states
1 Mask 51

In total, 51 states/territoties implemented a mask related policy. This is more widespread than the individual policies, however, would still not be in one of the top five policies implemented accross the country. Furthermore, there is variance in what these three different mask policies mean for the public regarding face coverings.

mask_reqs_start_scat = px.scatter(data_copy[(data_copy.policy_type == "Mask")], x="date",y= "policy_type", color = "state_id", title = "Start Mask Requirement")

mask_reqs_start_scat

The states that did implement mask related policies started them at a variety of points throughout the pandemic. For example, HI started a mask related policy on Dec 30, 2020, this is well into the pandemic. AK and MN being two of the first to have the policy started, beginning them back in March when the pandemic was starting!

Hopefully this example gave you some ideas of how you can visualize and analyze datasets using Python!!