COVID-19 has shut down much of America, but which groups have been affected the most?
To answer this question, I examined the New York Times' county-level COVID-19, and paired it with county-level demographic data estimates from the 2018 American Community Survey.
The goal is to get an idea of whether or not American counties with certain demographic characteristics tend to have more positive COVID-19 tests.
There are a few important caveats here:
It's important to note that this analysis is an estimation - I am not using data that tells us the exact characteristics of who has tested positive for the virus. This disease is fairly new and growing quickly, so this information isn't readily available yet. This analysis serves as an alternate method of estimating which groups may be most affected by COVID-19.
More accurately, this analysis asks, "do counties with a higher prevelance of Group X tend to have more reported COVID-19 cases?"
My findings are presented below.
# import packages
import pandas as pd
import requests
import json
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# read COVID-19 data
covid = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',dtype=object)
covid = covid[covid.date=='2020-04-07']
covid = covid.iloc[:,-3:-1]
covid['cases'] = covid['cases'].astype('float64')
# define api key
key = '2988f01f5e86175bda8beae2b5035e1ccef2d052'
# get column names
url = 'https://api.census.gov/data/2018/acs/acs5?get=B01003_001E,B02001_002E,B02001_003E,B02017_001E,'\
+'B02001_005E,B02001_006E,B02001_007E,B02001_008E,B03002_012E,C18108_002E,B05001_006E,B99051_005E,'\
+f'B17001_002E,B08519_009E&key={key}&for=county:*&in=state:*'
JSONContent = requests.get(url).json()
populations = pd.DataFrame(JSONContent)
populations.columns = populations.iloc[0]
populations = populations.iloc[1:,:]
# rename columns
columns = ['population','white','black','AIAN','asian','pacific_islander',\
'other_race','multiracial','hispanic','under18',\
'noncitizen','foreign_born','below_poverty_level','make_over_75k',\
'state','county']
populations.columns = columns
# parse population to float
populations.population = populations.population.astype('float')
# parse demographic data to float and normalize by population
for column in populations.columns:
if column not in ['population','county','state']:
populations[column] = populations[column].astype('float')
populations[column] = populations[column] / populations['population']
# add fips codes for populations
populations['fips'] = populations['state'].astype('str') + populations['county'].astype('str')
# merge population data into county reponses
populations = populations.drop(columns=['county','state'])
# merge data for all counties
full_data = pd.merge(covid,populations,'right','fips')
full_data.cases = full_data.cases.fillna(0)
# normalize by population, changing zero values to 0.25 to make log transformation possible
# note that this will lead to some bias
full_data['cases'] = np.where(full_data['cases'] == 0, .25,full_data['cases'])
full_data['cases'] = (full_data['cases']) / full_data['population']
full_data = full_data.drop(columns=['fips'])
# rename variables
variables = ['covid_19'] + columns
variables.remove('county')
variables.remove('state')
full_data.columns = variables
I will use a log scale for COVID-19 cases, in order to properly reflect that they are growing exponentially. In short, this will allow us to better gauge the relationship between prevalence of reported COVID-19 cases and prevalence of certain demographic groups.
As we can see below, this will be helpful due to the extremely right-skewed distribution of county case prevalence.
Note also that I normalized infection totals by county populations, so we are dealing with local infection rates, rather than raw numbers of infections.
import matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
plt.title('Proportional Prevalence of Reported COVID-19 Cases By County')
plt.xlabel('Reported Proportion Of Population Infected')
plt.ylabel('Number of Counties')
_ = plt.hist(full_data.covid_19,bins=100)
truncated_data = full_data[full_data.covid_19 < .002]
plt.figure(figsize=(12,8))
plt.title('Proportional Prevalence of Reported COVID-19 Cases By County (only .2% or lower shown)')
plt.xlabel('Reported Proportion Of Population Infected')
plt.ylabel('Number of Counties')
_ = plt.hist(truncated_data.covid_19,bins=35)
As we can see, the chosen log transformation (details in the code) gives us a more normal-looking distribution that should be more reasonable for our purposes.
full_data_log = full_data.copy()
full_data_log['covid_19'] = np.log(full_data_log['covid_19'])
plt.figure(figsize=(12,8))
plt.title('Proportional Prevalence of Reported COVID-19 Cases By County, Log-Scaled')
plt.xlabel('Log-Transformed Reported Proportion Of Population Infected')
plt.ylabel('Number of Counties')
_ = plt.hist(full_data_log.covid_19,bins=35)
For reference while reading the below results: correlations range on a constant scale from -1 (perfectly negatively correlated), to 0 (absolutely no correlation), to 1 (perfectly positively correlated) - which is to say if a value is very close to zero, two factors are not linearly correlated.
Generally, a correlation would need to be at or above .3 or at or below -.3 for us to consider it anything more than weak - but given that most of the groups we're referring to are minorities, we can likely lower that threshold to account for their limited ability to impact response rates as a whole. If, say, a subgroup makes up only 10% of the population, it's unreasonable to expect it to have a drastic impact on overall response rates, and a correlation of .2 may be quite telling.
It is vital to note that correlation does not imply causation - we are simply noting that areas with certain demographics tend to have higher (or lower) reported infection rates (and again, these are reported, county-assigned infections), not definitively stating that certain groups are infected by the virus more frequently than others.
Also, we're using classic linear correlations (Pearson correlations) with an exponential transformation on the number of cases, which is only one kind of correlation. For that reason, failing to find a relationship here doesn't mean necessarily mean one doesn't exist.
results = pd.DataFrame(full_data_log.corr().round(2).iloc[1:,0])
results.columns = ['Correlation with COVID-19 infection rates']
results
Normally, these numbers would indicate weak correlations - but given the aforementioned caveat that these are generally minority groups, some of these correlations could perhaps be better interpreted as quite strong.
Areas with a higher prevalence of certain groups seem particularly at risk:
Meanwhile, whiter areas seem to fare better, suggesting clear racial divides.
Interestingly, heavily-Hispanic/Latino areas appear to generally have significantly fewer reported infections. This has to do with the fact that Puerto Rico (where Hispanics make up over 90% of the population in 77 of 78 counties) has only 573 reported cases so far, which is skewing the data. Further investigation (which we won't get too deeply into here for brevity reasons) found that this correlation would actually be -.03 with Puerto Rico removed from the dataset - decidedly a lower-magnitude relationship than what we would expect a priori based on news reports.
On a related note, a very odd trend here is that poorer populations seem to fare much, much better than rich populations.
These particular results make very little sense in context, which strongly suggests that these numbers may have more to do with who is getting tested than who actually has the disease.
This brings back a key point here: all this only refers to reported cases, not who's actually sick. So some of these numbers could go up or down significantly if we were to model on the true number of cases.
And once more, these are simply correlations, not causal links - it could be that the presence of a certain subgroup is a proxy for another county characteristic, and that group isn't actually affected more or less by COVID-19 than other groups.
However, these trends certainly raise an eyebrow on multiple levels, at the very least.
I'll now run hypothesis tests, seeing if we can reject our null hypothesis that none of these factors independently significantly correlate with the (log-scaled) number of reported COVID-19 cases.
I'll use a 99.5% level of significance, using a high bar for significance.
from statsmodels.formula.api import ols
pvals = pd.DataFrame(index = [0,4])
for item in full_data_log.columns[1:-2]:
formula = str(full_data_log.columns[0]) + ' ~ ' + str(item)
hypothesis = str(item) + ' = 0'
results = ols(formula, full_data_log).fit()
pvals = pd.concat([pvals,pd.DataFrame(pd.DataFrame(results.summary().tables[1]).iloc[2,[0,4]])],axis=1)
pvals.columns = pvals.iloc[0]
pvals = pvals.drop(index=0)
pvals = pvals.transpose()
pvals.columns = ['P-value of Simple Regression Coefficient']
pvals.index.name = ''
pvals
Nearly all of these p-values are vitually zero, which means we expect there would be virtually zero chance of seeing this data if there was no relationship between these socioeconomic factors and prevalence of (log-scaled) reported COVID-19 cases.
Except for the 'multiracial', 'pacific_islander', and 'under18' categories (which do not appear to be significant), we would very clearly consider the rest of these relationships to be significant at just about any reasonable level of significance.
We can evaluate the combined effect of these variables by examining the r-squared of the linear relationship between these factors and prevalence of (log-scaled) reported COVID-19 cases.
formula = str(full_data_log.columns[0]) + ' ~ '
for item in full_data_log.columns[1:-2]:
formula = formula + str(item) + ' + '
formula = formula[:-3]
results = ols(formula, full_data_log).fit()
results.rsquared
results.rsquared ** .5
A linear model built on these combined socioeconomic factors captures about 15% of the variation in log-scaled reported COVID-19 cases, which is akin to a correlation coefficient with an absolute value of about .38.
If this data is to be believed, this would indicate a moderately strong relationship between socioeconomic factors and prevalence of reported COVID-19 cases at the county level - but as we've seen, some of these relationships may be more than just moderate in context, and there are reasons to seriously doubt the validity of this data.
Our tests have suggested a significant correlation between socioeconomic factors and prevalence of reported COVID-19 cases. Most of these relationships appear to be weak-to-moderate at first glance, but may be considered strong in context.
Once again, the key caveat here is this only deals with reported cases. It's possible that these relationships are much stronger or weaker than the data suggests, as cases among poorer populations are believed to be more commonly unreported (and this certainly seems to be the case based on our data). It's also important to note our earlier caveat that one minority group can only impact response rates so much, so a weak correlation may be better interpreted as moderate in this context.
Perhaps if there is bias in how positive cases are reported, perhaps there's less bias in how deaths are reported.
Let's run this same analysis on reported COVID-19 deaths and see if the results say the same.
# read COVID-19 data
covid = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',dtype=object)
covid = covid[covid.date=='2020-04-06']
covid = covid.iloc[:,[-3,-1]]
covid['deaths'] = covid['deaths'].astype('float64')
# merge data for all counties
full_data = pd.merge(covid,populations,'right','fips')
full_data.deaths = full_data.deaths.fillna(0)
# normalize by population, changing zero values to 0.025 to make log transformation possible
# .025 chosen to maintain approximately normal distribution as was done earlier
# note that this will, again, lead to some bias
full_data['deaths'] = np.where(full_data['deaths'] == 0, 0.025,full_data['deaths'])
full_data['deaths'] = (full_data['deaths']) / full_data['population']
full_data = full_data.drop(columns=['fips'])
# rename variables
variables = ['covid_deaths'] + columns
variables.remove('county')
variables.remove('state')
full_data.columns = variables
full_data_log = full_data.copy()
full_data_log['covid_deaths'] = np.log(full_data_log['covid_deaths'])
results = pd.DataFrame(full_data_log.corr().round(2).iloc[1:,0])
results.columns = ['Correlation with COVID-19 death rates']
results
Many of the same trends bear themselves out here, although to a lesser extent. Perhaps it's too early to get a real feel for the death tolls, and/or deaths are too sparse (at least at this point) to generate any strong trends off of death totals.
Based on our results, it seems likely that testing is being administered inequitably (or perhaps cases aren't being reported consistently for another reason), which makes this important analysis quite difficult (and thus it should be taken with a grain of salt).
That being said, while further calculations and much more robust data would be needed to make more definitive statements, the data we have so far suggests that reported COVID-19 cases are clearly popping up among areas with certain socioeconomic characteristics more than others (that is, at least among those cases that have been reported), and some of these relationships appear to be moderately strong at the very least, if not strong. These trends seem to imply disparities in both infection rates and reliability of case reporting across counties.