Repository of files related to the NYT COVID-19 dataset

2020-04-04T23:36-05:00

I’ve created a GitHub repository based on the New York Times’ COVID-19 database.

The New York Times has been compiling COVID-19 data from various state health departments. They have published the data as a GitHub repository. There are two files of interest in their repository: us-states.csv and us-counties.csv.

In these CSV files, each row represents the number of (confirmed) cases and deaths in a particular geographical area as of a particular date. For example, the first ten rows of us-states.csv are:

date,state,fips,cases,deaths
2020-01-21,Washington,53,1,0
2020-01-22,Washington,53,1,0
2020-01-23,Washington,53,1,0
2020-01-24,Illinois,17,1,0
2020-01-24,Washington,53,1,0
2020-01-25,California,06,1,0
2020-01-25,Illinois,17,1,0
2020-01-25,Washington,53,1,0
2020-01-26,Arizona,04,1,0

These are easy enough to analyze using a tool like Pandas. However, this kind of file might be hard to analyze if you want to use a spreadsheet program.

So, I wrote a Python script that uses Pandas to output a bunch of CSV files, so people can analyze the data using spreadsheet software. The script also creates a bunch of plots, including one for each state.

The script and all the output are available in my new repo. I have licensed the script under an MIT-style license. Feel free to fork and improve! (The outputs of the script I haven’t so licensed since they are subject to the license restrictions imposed by the Times.)

A big disclaimer: I’m not an epidemiologist (my expertise being in noncommutative algebra, not communicable disease). So, please don’t make any policy decisions based on my plots or any other amateurs who don’t know what they are doing.

Stay safe and healthy!

U.S. COVID-19 cases and deaths, based on data from the New York Times