Â
To grasp the true impact of COVID-19 on our societies, we need to know the effect of the pandemic on mortality. In other words, we need to know how many deaths can be attributed to the virus, directly and indirectly1.
It is already popular to visualize mortality in order to gauge the impact of the pandemic in different countries. You might have seen at least some of these graphs and websites: FT, Economist, Our World in Data, CBS, EFTA, CDC, EUROSTAT, and EUROMOMO. But estimating the impact of COVID-19 on mortality is also controversial, with people either misunderstanding or distrusting the way in which the impact is measured and assessed.
In this document, I show step-by-step how we can estimate the impact of COVID-19 on mortality. I build a large number of statistical models that we can use to predict expected mortality in 2020. The complexity of the models ranges from the simplest, based only on weekly averages from past years, to what is currently the state of the art.
But this is not all. What I also do is review the predictive performance of all of these models, so that we know which ones work best. I run the models on publicly available data from the Netherlands, I use only the open software R
, and I share the code here, so anyone can check, replicate and extend the exercise.
I hope this document will provide some transparency about how expected mortality is and can be estimated from data and how it can be used to identify the effect of the pandemic. Feel free to try different estimation strategies, adapt the code to data from different countries, or use different measures to compare the performance of the models.
Before we begin, let’s get on the same page with regard to our main concepts. To estimate the effect of COVID-19 on mortality, we rely on the concept of excess mortality, which is defined as the observed number of deaths over a period of time minus the number of deaths we expected to observe. Clearly, this implies that we need to estimate the expected mortality that we would have observed in the absence of the pandemic.
How do we estimate expected mortality? Simply put, we extrapolate (predict) from the observed past trends in mortality, possibly adding contemporaneous data (for example, temperature) to aid the predictions. In short, we fit statistical models to data from the past, and we extrapolate from these models for 2020. We can fit many different statistical models to the past data, each making different decisions about what features of the data to use, and how. Some of these models will make better predictions than others, so it is important to compare their predictive performance before we settle on the one or several models we will use to calculate excess mortality.
First, we need to obtain data on mortality. The Dutch statistical institute CBS provides regularly updated data on deaths from all causes in the country. The data is accessible directly from within R
using the cbsodataR
package. Here is the code to get the relevant mortality dataset. This dataset starts in the beginning of 1995 and the latest available week (at the time of writing) is Week 52 of 2020. We also load the packages we gonna use for working with the data.
library(tidyverse)
library(cbsodataR)
library(xts)
library(utils)
library(httr)
# Get the mortality dataset
all.deaths <- cbs_get_data('70895ned') %>% # specify the dataset id
cbs_add_label_columns() %>%
mutate (year = as.numeric(as.character(substr(Perioden, 1, 4))), # extract the year
week = as.numeric(as.character(substr(Perioden, 7,8))), # extract the week
year.month = format(strptime(paste(year, week, 1), format = "%Y %W %u"),"%Y.%m"), # make a year.month index
filter_n = as.character(substr(Perioden, 5,6)), # will be used for filtering out
deaths = Overledenen_1, # rename column
n_days = trimws(gsub("\\(", "", substr(Perioden_label, 14, 15))), # this gets the number of days in the week
n_days = as.numeric(as.character(ifelse (nchar(n_days)==0, 7, n_days))), # usually 7, but not always for weeks 0, 1, 52 and 53; rescale
week = case_when(week == 0 ~ 1, week == 53 ~ 52, TRUE ~ week) # recode weeks 0 and 53 to 1 and 52
) %>%
filter(filter_n != 'JJ') # remove yearly totals
# Make function to filter and aggregate the data
data.prep <- function (sex, age_group){
deaths.temp <- all.deaths %>%
filter (Geslacht == sex, LeeftijdOp31December == age_group) %>%
dplyr::select (year, week, year.month, deaths, n_days) %>%
mutate (index = paste0(year, '.', week)) %>%
group_by (index) %>% # this is needed for weeks 1 and 52
summarize(year=mean(year), week=mean(week), deaths = sum(deaths), year.month = first (year.month), n_days = sum(n_days)-7) %>%
dplyr::select (-index) %>%
arrange(year, week)
}
# Apply the function to get all deaths
deaths <- data.prep(sex = 1100, age_group = 10000) # all deaths
# We can also use the function to get deaths for specific sex and age groups
deaths.m <- data.prep(sex = 3000, age_group = 10000) # men only
deaths.w <- data.prep(sex = 4000, age_group = 10000) # women only
deaths.80plus <- data.prep(sex = 1100, age_group = 21700) # all, 80+ deaths
deaths.65to80 <- data.prep(sex = 1100, age_group = 53950) # all, 65 to 80 deaths
deaths.0to65 <- data.prep(sex = 1100, age_group = 41700) # all, 0 to 65 deaths
deaths.m.80plus <- data.prep(sex = 3000, age_group = 21700) # men, 80+ deaths
deaths.m.65to80 <- data.prep(sex = 3000, age_group = 53950) # men, 65 to 80 deaths
deaths.m.0to65 <- data.prep(sex = 3000, age_group = 41700) # men, 0 to 65 deaths
deaths.w.80plus <- data.prep(sex = 4000, age_group = 21700) # women, 80+ deaths
deaths.w.65to80 <- data.prep(sex = 4000, age_group = 53950) # women, 65 to 80 deaths
deaths.w.0to65 <- data.prep(sex = 4000, age_group = 41700) # women, 0 to 65 deaths
As you can see, getting the mortality data per week for the Netherlands is quite straightforward. But if you don’t want to bother with getting the data yourself, you can access the complete dataset used in the analysis here.
There is only one complication that creates a lot of nuisance: some weeks start with Week 0 rather than Week 1, few years end with Week 53 rather than Week 52, and some years have incomplete Week 1 and Week 52. We can ignore that (as almost everyone else does), but we can also tackle the complication head on. To do so, we recode weeks 0 and 53 to 1 and 52 respectively, then we aggregate the data so that these weeks feature only once in the resulting dataset, and - most importantly - we create a variable that tracks the number of days in the week, which will be the same for all weeks between 2 and 51, but will differ for weeks 1 and 52. In this way we keep the abnormal starting and ending weeks of the year in the data, but we will be able to account for their unequal length when we analyze the data.
We will also need the total population data for the country, so let’s get it from the same source. In addition to the total population, we can extract the population by sex and age groups. Since the data for 2020 is not yet available, we extrapolate from the last observed values.
# Get the population table
all.pop <- cbs_get_data('7461eng') %>%
cbs_add_label_columns() %>%
mutate (pop = TotalPopulation_1 / 1e6, # divide by a million
year = as.numeric(as.character(Periods_label))) %>%
filter (year > 1994) %>%
dplyr::select (Sex_label, Age_label, year, pop)
# Make function to filter data
pop.data.prep <- function (sex, age_group){
pop.temp <- all.pop %>%
filter(Sex_label == sex, Age_label == age_group) %>%
mutate(change.p = 1+(pop-lag (pop))/lag (pop)) # create a population change variable
pop.temp <- pop.temp %>% # interpolate data for 2020
add_row (year = 2020, pop = last(pop.temp$pop)*mean(tail(pop.temp$change.p,2))) %>% # add data for 2020
dplyr::select (year, pop)
}
# Apply the function to get total population
pop<-pop.data.prep (sex='Men and women', age_group='Total population (all ages)') # all population
# Population per sex
pop.m<-pop.data.prep (sex='Men', age_group='Total population (all ages)') # men only
pop.w<-pop.data.prep (sex='Women', age_group='Total population (all ages)') # women only
# Population per age group
pop.0to20 <- pop.data.prep (sex='Men and women', age_group='0 to 20 years')
pop.20to65 <- pop.data.prep (sex='Men and women', age_group='20 to 65 years')
pop.65plus <- pop.data.prep (sex='Men and women', age_group='65 years or older')
pop.65to70 <- pop.data.prep (sex='Men and women', age_group='65 to 70 years')
pop.70to75 <- pop.data.prep (sex='Men and women', age_group='70 to 75 years')
pop.75to80 <- pop.data.prep (sex='Men and women', age_group='75 to 80 years')
# We gotta do some work to match the age categories from the mortality dataset
pop.0to65 <- left_join(pop.0to20, pop.20to65, by = 'year') %>%
mutate (pop = pop.x + pop.y) %>%
dplyr::select (year, pop)
pop.65to80 <- left_join(pop.65to70, pop.70to75, by = 'year') %>%
left_join(., pop.75to80, by = 'year') %>%
mutate (pop.65=pop.x+pop.y+pop) %>%
dplyr::select (year, pop.65)
pop.80plus <- left_join(pop.65plus, pop.65to80, by = 'year') %>%
mutate (pop.80 = pop - pop.65) %>%
dplyr::select (year, pop.80)
# We can also do the same per sex and age group simultaneously. Maybe later ...
Ideally, we would have population data on a weekly basis to match the mortality data, but this is not available from CBS. Monthly data, however, is available, although not disaggregated by sex and age group. Still, let’s get it in case we want to use it later.
# Get population per month
month.pop <- cbs_get_data('83474ENG') %>%
cbs_add_label_columns() %>%
mutate (pop.m = PopulationAtStartOfSelectedPeriod_1 / 1e6, # divide by a million
year = as.numeric(as.character(substr(Periods_label, 1, 4))),
month = as.character(substr(Periods, 7, 8)),
year.month = paste0(year, '.', month), # create index
filter_n = as.character(substr(Periods, 5,6))) %>%
filter (year > 1994, filter_n != 'JJ') %>%
add_row (year = 2020, month = '12', year.month = '2020.12', pop.m = 17.46964) %>%
dplyr::select (pop.m, year.month)
Now we can merge the mortality and population datasets and calculate the number of deaths per capita (per million of population). We will also add a counter variable for the year and the shares of people between 65 and 80, older than 65, and older than 80 from the total population.
d <- left_join (deaths, pop, by = 'year') %>% # add total population
left_join(., pop.80plus, by='year') %>% # add population above 80
left_join(., pop.65to80, by='year') %>% # add population 65 to 80
left_join(., month.pop, by='year.month') %>% # add monthly population
mutate (deaths.pc = deaths / pop, # calculate deaths per capita
share.80 = pop.80/pop*100, # calculate share of 80+
share.65 = pop.65/pop*100, # calculate share of 65 to 80
share.65plus = share.80 + share.65, # calculate share of 65+
counter = year-2009 # a counter variable for the year
) %>%
filter(year < 2021)
For the rest of the analysis we will work with the weekly data, but let’s quickly visualize the annual mortality trends. Plotting the annual average weekly mortality rate (top panel) and the total number of deaths per year (bottom panel) over the period 1995 to 2020, we notice that the trends prior to 2007 and after 2010 are quite different. This is important because if we want to include a yearly time trend in our predictions for 2020, the older trend prior to 2007 can distort the predictions. (if you are interested why the mortality rate in the Netherlands increases over the past years, have a look here)
mean.year<-d %>%
group_by(year) %>%
summarize(total.deaths = sum (deaths),
mean.deaths.pc=mean(deaths.pc),
pop = mean(pop)
)
par(mfrow=c(2,1), mar=c(4,2,1,1), bty='n')
plot( x = mean.year$year, y = mean.year$mean.deaths.pc, type='l', lwd=2, col='red', ylab='', xlab='Average weekly mortality rate per million per year in the Netherlands')
points( x = mean.year$year, y = mean.year$mean.deaths.pc, col='red', pch=21, bg='white')
plot( x = mean.year$year, y = mean.year$total.deaths, type='l', lwd=2, col='darkblue', ylab='', xlab='Total mortality (number of deaths) per year in the Netherlands')
points( x = mean.year$year, y = mean.year$total.deaths, col='darkblue', pch=21, bg='white')