First, the correct libraries need to be imported. This tutorial utilizes the following libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
The file used in this tutorial can be found here. The information is regularly updated so if you get slighly different numerical results, do not be concerned. Data is stored in a csv file. To read the data, we can use pandas.read_csv. This reads a comma-separated values (csv) file into a DataFrame.
data = pd.read_csv("state_policy_updates_20210221_0722.csv") #Read csv
data
state_id | county | fips_code | policy_level | date | policy_type | start_stop | comments | source | total_phases | |
---|---|---|---|---|---|---|---|---|---|---|
0 | HI | NaN | NaN | state | 2020-05-18 | Manufacturing | start | Policy_Details: Open with adjusted "Safe Pract... | sip_submission_form: https://governor.hawaii.g... | NaN |
1 | TX | Kinney | 48271.0 | county | 2020-07-03 | Mask Requirement | start | Policy_Details: County is approved to be exemp... | sip_submission_form: https://tdem.texas.gov/ga... | NaN |
2 | ID | Custer | 16037.0 | county | 2020-10-27 | Phase 3 | start | Policy_Details: No greater than 50 people at i... | sip_submission_form: https://coronavirus.idaho... | 4.0 |
3 | UT | Wayne | 49055.0 | county | 2020-11-24 | Phase 1 | start | Policy_Details: Restrictions for highest level... | sip_submission_form: https://coronavirus.utah.... | 3.0 |
4 | IL | Ford | 17053.0 | county | 2021-01-15 | Food and Drink | start | Policy_Details: Bars and restaurants: Open wit... | sip_submission_form: https://coronavirus.illin... | 5.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4175 | TX | Morris | 48343.0 | county | 2020-07-03 | Mask Requirement | start | Policy_Details: County is approved to be exemp... | sip_submission_form: https://tdem.texas.gov/ga... | NaN |
4176 | ME | Piscataquis | 23021.0 | county | 2020-05-18 | Outdoor and Recreation | start | Policy_Details: Rural Reopening : wilderness ... | sip_submission_form: https://www.maine.gov/cov... | 3.0 |
4177 | NY | NaN | NaN | state | 0020-06-08 | Resumed Elective Medical Procedures | start | Other measures and details for this policy inc... | BU COVID-19 State Policy Database | NaN |
4178 | WV | NaN | NaN | state | 2020-03-19 | Food and Drink | start | Other measures and details for this policy inc... | BU COVID-19 State Policy Database | NaN |
4179 | IL | Lee | 17103.0 | county | 2021-01-15 | Non-Essential Businesses | start | Policy_Details: - All employees return to work... | sip_submission_form: https://coronavirus.illin... | 5.0 |
4180 rows × 10 columns
After reading the data in, we want to clean the data. In this dataset, some of the dates state 0020 instead of 2020 for the year. Additionally, one data point has 1899 as the date. Being that COVID is more recent, that year is assumed to be 2020 as well. The following code locates and fixes the errors in the dates in the data set.
dates_to_change = []
for index, row in data.iterrows():
if "0020" == row.date[0:4]:
data.loc[index, "date"] = "2020" + row.date[4:]
elif "1899" == row.date[0:4]:
data.loc[index, "date"] = "2020" + row.date[4:]
Th above for loop iterates through each row in the DataFrame stored in data. With each row, the if/elsif statements check to see if the first four digits of the date are either “0020” or “1899” with the equal-to operator (==). If either of the error values are present, the values are replaced with “2020”.
To begin the investigation, let us visualize the data in a scatter plot. The below code does just that in the following steps:
sorted = data.sort_values(by=["state_id"]) #sort the data by state
start_scat = sorted[(sorted.start_stop=="start")]
px.scatter(start_scat, x="date",y= "policy_type", color = "state_id", title = "Policies by State and Date")
The initial scatterplot provides a crude visualization of how start of policies in timeseries, color-coded by state. There is not much one can tell from this graph aside from the fact that the implementation of any given policy varied widely in time.
The code below finds the three most started policies, by counting the number of times that policy was started. This means if that policy was started mulitple times in a single state, it would be counted each time.
start_data = data[(data.start_stop=="start")] #Look only at the start of policies
policies = start_data.policy_type.unique() #Find all the unique policies
count = [] #Create an empty count vector
for val in start_data.policy_type.unique(): #Create a for loop to go through each unique policy
ope = start_data[start_data.policy_type == val] #using a mask, find all the policies that match the unique policy
siz = ope.shape #find out the number of rows that have that policy
count.append(siz[0])
d = {"policy":policies, "num_starts":count}
results = pd.DataFrame(d) #create results dataframe
results.sort_values(by=["num_starts"], ascending = False) #sort the data from largest to smallest number of starts
policy | num_starts | |
---|---|---|
10 | Shelter in Place | 462 |
4 | Food and Drink | 263 |
16 | Outdoor and Recreation | 242 |
1 | Mask Requirement | 239 |
13 | Non-Essential Businesses | 226 |
... | ... | ... |
44 | Medical | 1 |
38 | Colleges & Universities | 1 |
42 | Graduation Ceremony guidelines | 1 |
41 | State of Emergency/Funds | 1 |
64 | Agriculture | 1 |
65 rows × 2 columns
By printing the dataframe, one can see the policies with the most starts include: Shelter in Place with 462 starts, Food and Drink with 263 starts, and Outdoor and Recreation with 242 starts.
The code below looks the number of states/territories that implemented each policy. This is interesting to see how widespread the implementation of certain policies are and to determine their reach across the nation.
count = [] #reset count to 0
for val in start_data.policy_type.unique(): #loop through each unique policy
poli = start_data[start_data.policy_type == val] # poli hold the dataframe rows for the unique policy
ct = 0 #set a counter variable to 0
for val2 in poli.state_id.unique(): #loop through each unique state within poli
ct = ct + 1 #increment count for each unique state
count.append(ct) #Add the number of states to the counter list
results["num_states"] = count #Add to the results dataframe
results.sort_values(by=["num_states"], ascending = False) #sort the data by state
policy | num_starts | num_states | |
---|---|---|---|
13 | Non-Essential Businesses | 226 | 54 |
4 | Food and Drink | 263 | 54 |
12 | Entertainment | 196 | 53 |
5 | Childcare (K-12) | 193 | 53 |
16 | Outdoor and Recreation | 242 | 52 |
... | ... | ... | ... |
38 | Colleges & Universities | 1 | 1 |
36 | Graduation | 1 | 1 |
11 | Wholesale Trade | 5 | 1 |
27 | Public Health Advisory System | 27 | 1 |
64 | Agriculture | 1 | 1 |
65 rows × 3 columns
The policies with the most widespread implementation include: Non-Essential Businesses with 54 states/territories starting that policy, Food and Drink with 54 states/territories starting that policy, Entertainment with 53 states/territories starting that policy, and Childcare (K-12) with 53 states/terrirtories starting that policy.
sorted_by_state = results.sort_values(by=["num_states"], ascending = False) #sort the data by state
plt.figure(figsize=(25, 5))
plt.bar(sorted_by_state.policy,sorted_by_state.num_states)
plt.xticks(rotation = 'vertical')
plt.ylabel('Number of States Starting Policy')
plt.xlabel('Policy')
plt.title("Policy Implementation Across US: Number of States Starting Policy vs Policy")
In as early as April, the CDC recommended masks to prevent the spread of the virus. However, interestingly, masks requirements do not make the top of the list for the number or starts or for being a widespread policy across the nation. This seems to be a major gap in slowing/stopping the pandemic!
To further understand, we want to look at which policies involve masks and further investigate them.
num = 0;
for index, row in results.iterrows():
if ("Mask" in row.policy):
num = num+1
The above for loop iterates thorugh each row in the DataFrame. If the work “mask” appeared within the policy name of a given row, the counting variable “num” was increased by 1.
After completion of this loop, num equals 3. There are three different mask related policies: “Mask Requirements”, “Mandate Face Mask Use By All Individuals In Public Spaces”, and “Mandate Face Mask Use By All Individuals In Public Facing Businesses”, all of which we will look at below.
mask_req = results[results.policy == "Mask Requirement"]
mask_public = results[results.policy =="Mandate Face Mask Use By All Individuals In Public Spaces"]
mask_bus = results[results.policy =="Mandate Face Mask Use By All Individuals In Public Facing Businesses"]
mask_req
policy | num_starts | num_states | |
---|---|---|---|
1 | Mask Requirement | 239 | 41 |
mask_public
policy | num_starts | num_states | |
---|---|---|---|
14 | Mandate Face Mask Use By All Individuals In Pu... | 38 | 38 |
mask_bus
policy | num_starts | num_states | |
---|---|---|---|
24 | Mandate Face Mask Use By All Individuals In Pu... | 46 | 46 |
Mask Requirement had the highest number of starts with 239 starts and 41 different states. Mandate Face Mask Use By all Individuals in Public Facing Businesses had the next highest number, with 46 starts and 46 states. Finally, Mandate Face Mask Use By All Individuals in Public Spaces had 38 starts and 38 states.
With these numbers, the next question is: How many states implemented any of the Masks policies? The following code investigates that question.
data_copy = data.copy() #Create a copy of the data
#Loop through and change any policy type involving masks to be called Mask in the copy of the data
for index, row in data_copy.iterrows():
if ("Mask" in row.policy_type):
data_copy.loc[index, "policy_type"] = "Mask"
data_copy = data_copy[(data_copy.start_stop=="start")] #Look only at the data that is a "start"
count = [] #reset count to 0
for val in data_copy.policy_type.unique(): #loop through each unique policy
poli = data_copy[data_copy.policy_type == val] # poli hold the dataframe rows for the unique policy
ct = 0 #set a counter variable to 0
for val2 in poli.state_id.unique(): #loop through each unique state within poli
ct = ct + 1 #increment count for each unique state
count.append(ct) #Add the number of states to the counter list
policies2 = data_copy.policy_type.unique() #Find all the unique policies
w = {"policy":policies2, "num_states":count}
results_mask = pd.DataFrame(w) #create dataframe
results_mask[(results_mask.policy == "Mask")] #Look at the number of states that implemented a mask related policy
policy | num_states | |
---|---|---|
1 | Mask | 51 |
In total, 51 states/territoties implemented a mask related policy. This is more widespread than the individual policies, however, would still not be in one of the top five policies implemented accross the country. Furthermore, there is variance in what these three different mask policies mean for the public regarding face coverings.
mask_reqs_start_scat = px.scatter(data_copy[(data_copy.policy_type == "Mask")], x="date",y= "policy_type", color = "state_id", title = "Start Mask Requirement")
mask_reqs_start_scat
The states that did implement mask related policies started them at a variety of points throughout the pandemic. For example, HI started a mask related policy on Dec 30, 2020, this is well into the pandemic. AK and MN being two of the first to have the policy started, beginning them back in March when the pandemic was starting!
Hopefully this example gave you some ideas of how you can visualize and analyze datasets using Python!!