Reporting Inextricable Statistics is a Problem

If its use of these items is typical of the NHS at large, the range of daily demand would be between 7.5 million to 12 million, more than the 5.5 million actually supplied. [Source]

It’s been quite clear from the beginning of the epidemic that statistical modelling is not the forte of the UK government. From expected infection counts to levels of social distancing, unexpected care home fatalities to PPE requirements: the UK Government have really struggled to put forward numbers that can be used as a basis of comparison and when they do it’s undoubtedly too late. Moreover, at times, the inextricable statistics often presented can be used against the Government to stoke fear.

As a statistician, it’s imperative to me that the public should understand how to interpret this data properly.

The UK Government is not alone in inextricable reporting though. All Governments across the world have been at fault of this, and they should make the extra effort to report relative statistics. As these statistics are being used for the basis of solving a problem, then the statistic reported must be relative to the problem. Don’t tell us how many items of PPE you’ve sourced: tell us what percentage of demand have you sufficed.

Statistics used for comparisons must be relative

One thing that stuck with me from having been lucky enough to study under Sir Professor David Mackay was that Statistics needs to approachable to make a difference. Back of the envelope statistics can really carry weight however, you have to recognise which question your statistic is answering. As follows, I demonstrate how making simple adjustments to commonly reported figures help in answering pretty big questions.

The Case of Infection Counts

The data I use is from the most reliable source: John Hopkins University. More of this at the end.

Despite a seemingly simple task (it’s not), counting the number of infection cases is important for healthcare organisations to monitor the spread of a pandemic. However once the counts get quite large, is it still meaningful?

Let’s look at this common chart: here we have the infections per country. They all look to be increasing and slightly flattening around the 225,000 mark (France and Italy a bit lower at 160kish, with Sweden down at 25kish).

Now as we look at this chart, we could say from it that Sweden is doing the best of all countries, for it has the fewest cases. However, Swedens population is almost 10x lower than the other countries, so we could expect (assuming a similar rate of spreading) that the number of cases would be 10x less by virtue of population size.

Therefore to compare how well one country is doing in relation to another country, we should be comparing a relative statistic: we should look at a statistic that has been adjusted for population to monitor the relative risk of being infected. This is known as Period prevalence:

Period prevalence is the number of individuals identified as cases during a specified period of time, divided by the total number of people in that population.

The following chart shows exactly this and the perspective entirely changes:

By monitoring period prevalence, we can now see that Spains infection count per million is much higher there than in other countries, and therefore much worse. Moreover from this perspective, Sweden does not seem to be doing the best of all other countries, that looks to actually be Germany.

Reporting absolute case counts under-represents the significance of the problem in Sweden.

By transforming our absolute count statistic to a normalised measure, we can more effectively monitor relative risk and make better judgements about how one country performs in relation to another country.

On from this, we know that the epidemic is spreading widely and governments are reacting, but how well are they reacting and how useful have their actions been? To monitor this, we can look at how this period prevalence changes between two reasonable time periods.

You would hope that as a country goes into lockdown, that less people are getting infected. To monitor this, we can look at the changes in infection rate to monitor how well governments are dealing with the spread, and how this has changed through time.

To monitor change, we need to pick a time period that robust. We know of issues of weekend seasonality and the front-loading of US case counts, so in the following chart I take a 10 day difference of the infections per million count to smooth over these features and more effectively monitor how the rate of growth:

Note: the result from this chart is robust to different time gaps — the user is encouraged to experiment. Spoiler alert — the result is largely the same.

So, in the past 10 days, the UK’s infection count (per million population) has reported 750 more cases (per million), compared to Spain which has reported an increase of 250 (per million). This tells us that in the UK, the virus is still spreading more than in Spain. It actually seems that the UK is currently in the worst position of major European countries and Sweden looks to be in the second worst position — owing largely due to not going into lock down — something you can not tell from looking at Figure 1.

In the above article, I show that by dividing by population, and calculating differences over time, we can form a picture that provides us as much insight as other more academically thorough statistics. Other metrics (like the R0 and others here) can often be seen to be inextricable because of their complex derivations and concepts. However, the public need to given simple mathematics to quickly gauge and understand the severity of the problem.

Everyone: and I mean everyone can learn something from looking at these simple numbers and looking at statistics more relatively.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Data

Academics at John Hopkins University bring together data from various reliable sources (including all major health organisations) to track the rate of growth of Coronavirus. The dataset I look at (covid-19_confirmed_global) is updated at regular intervals during the day and can be accessed with a simple read_csv function from pandas (in python) to import the data into a data frame. I remove the present day due to the data being updated intraday.

Code

# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
# Countries I want to compare
eu_list = ['United Kingdom','France','Germany','Spain','Italy','Sweden']
fp = 'https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv&filename=time_series_covid19_confirmed_global.csv'
# Sum Across Countries/Regions ~ then you get daily differences
df = df.groupby(['Country/Region',]).sum().T.diff()
# Get population of each country
from countryinfo import CountryInfo
pop = {}
for c in eu_list:
pop[c] = CountryInfo(c).population()

pop = pd.DataFrame(pd.DataFrame(pop, index=pop.keys()).ix[0,:])
pop.columns = ['country']
# Adjust all statistics by Population
pop['multiplier'] = 1000000. / pop['country']
df2 = df.copy()
for k in eu_list:
df2[k] = (df2[k] * pop.ix[k,'multiplier'])
# Plot Infections per million
df[eu_list].cumsum().plot(figsize=(15,7),title = 'Infections per country [Updated up to 20200508]').grid(); plt.show()
# Cases Per Million Population Plot
df2[eu_list].cumsum().plot(figsize=(15,7),title = 'Infections per million, per country [Updated up to 20200508]').grid(); plt.show()
# Cases Per Million Population Plot
df2[eu_list].cumsum().diff(10).plot(figsize=(15,7),title = '10-day change in infections Per Million People [Updated up to 20200508]').grid(); plt.show()