To grasp the true impact of COVID-19 on our societies, we need to know the effect of the pandemic on mortality. In other words, we need to know how many deaths can be attributed to the virus, directly and indirectly1.
It is already popular to visualize mortality in order to gauge the impact of the pandemic in different countries. You might have seen at least some of these graphs and websites: FT, Economist, Our World in Data, CBS, EFTA, CDC, EUROSTAT, and EUROMOMO. But estimating the impact of COVID-19 on mortality is also controversial, with people either misunderstanding or distrusting the way in which the impact is measured and assessed.
In this document, I show step-by-step how we can estimate the impact of COVID-19 on mortality. I build a large number of statistical models that we can use to predict expected mortality in 2020. The complexity of the models ranges from the simplest, based only on weekly averages from past years, to what is currently the state of the art.
But this is not all. What I also do is review the predictive performance of all of these models, so that we know which ones work best. I run the models on publicly available data from the Netherlands, I use only the open software R
, and I share the code here, so anyone can check, replicate and extend the exercise.
I hope this document will provide some transparency about how expected mortality is and can be estimated from data and how it can be used to identify the effect of the pandemic. Feel free to try different estimation strategies, adapt the code to data from different countries, or use different measures to compare the performance of the models.
Before we begin, let’s get on the same page with regard to our main concepts. To estimate the effect of COVID-19 on mortality, we rely on the concept of excess mortality, which is defined as the observed number of deaths over a period of time minus the number of deaths we expected to observe. Clearly, this implies that we need to estimate the expected mortality that we would have observed in the absence of the pandemic.
How do we estimate expected mortality? Simply put, we extrapolate (predict) from the observed past trends in mortality, possibly adding contemporaneous data (for example, temperature) to aid the predictions. In short, we fit statistical models to data from the past, and we extrapolate from these models for 2020. We can fit many different statistical models to the past data, each making different decisions about what features of the data to use, and how. Some of these models will make better predictions than others, so it is important to compare their predictive performance before we settle on the one or several models we will use to calculate excess mortality.
First, we need to obtain data on mortality. The Dutch statistical institute CBS provides regularly updated data on deaths from all causes in the country. The data is accessible directly from within R
using the cbsodataR
package. Here is the code to get the relevant mortality dataset. This dataset starts in the beginning of 1995 and the latest available week (at the time of writing) is Week 52 of 2020. We also load the packages we gonna use for working with the data.
library(tidyverse)
library(cbsodataR)
library(xts)
library(utils)
library(httr)
# Get the mortality dataset
all.deaths <- cbs_get_data('70895ned') %>% # specify the dataset id
cbs_add_label_columns() %>%
mutate (year = as.numeric(as.character(substr(Perioden, 1, 4))), # extract the year
week = as.numeric(as.character(substr(Perioden, 7,8))), # extract the week
year.month = format(strptime(paste(year, week, 1), format = "%Y %W %u"),"%Y.%m"), # make a year.month index
filter_n = as.character(substr(Perioden, 5,6)), # will be used for filtering out
deaths = Overledenen_1, # rename column
n_days = trimws(gsub("\\(", "", substr(Perioden_label, 14, 15))), # this gets the number of days in the week
n_days = as.numeric(as.character(ifelse (nchar(n_days)==0, 7, n_days))), # usually 7, but not always for weeks 0, 1, 52 and 53; rescale
week = case_when(week == 0 ~ 1, week == 53 ~ 52, TRUE ~ week) # recode weeks 0 and 53 to 1 and 52
) %>%
filter(filter_n != 'JJ') # remove yearly totals
# Make function to filter and aggregate the data
data.prep <- function (sex, age_group){
deaths.temp <- all.deaths %>%
filter (Geslacht == sex, LeeftijdOp31December == age_group) %>%
dplyr::select (year, week, year.month, deaths, n_days) %>%
mutate (index = paste0(year, '.', week)) %>%
group_by (index) %>% # this is needed for weeks 1 and 52
summarize(year=mean(year), week=mean(week), deaths = sum(deaths), year.month = first (year.month), n_days = sum(n_days)-7) %>%
dplyr::select (-index) %>%
arrange(year, week)
}
# Apply the function to get all deaths
deaths <- data.prep(sex = 1100, age_group = 10000) # all deaths
# We can also use the function to get deaths for specific sex and age groups
deaths.m <- data.prep(sex = 3000, age_group = 10000) # men only
deaths.w <- data.prep(sex = 4000, age_group = 10000) # women only
deaths.80plus <- data.prep(sex = 1100, age_group = 21700) # all, 80+ deaths
deaths.65to80 <- data.prep(sex = 1100, age_group = 53950) # all, 65 to 80 deaths
deaths.0to65 <- data.prep(sex = 1100, age_group = 41700) # all, 0 to 65 deaths
deaths.m.80plus <- data.prep(sex = 3000, age_group = 21700) # men, 80+ deaths
deaths.m.65to80 <- data.prep(sex = 3000, age_group = 53950) # men, 65 to 80 deaths
deaths.m.0to65 <- data.prep(sex = 3000, age_group = 41700) # men, 0 to 65 deaths
deaths.w.80plus <- data.prep(sex = 4000, age_group = 21700) # women, 80+ deaths
deaths.w.65to80 <- data.prep(sex = 4000, age_group = 53950) # women, 65 to 80 deaths
deaths.w.0to65 <- data.prep(sex = 4000, age_group = 41700) # women, 0 to 65 deaths
As you can see, getting the mortality data per week for the Netherlands is quite straightforward. But if you don’t want to bother with getting the data yourself, you can access the complete dataset used in the analysis here.
There is only one complication that creates a lot of nuisance: some weeks start with Week 0 rather than Week 1, few years end with Week 53 rather than Week 52, and some years have incomplete Week 1 and Week 52. We can ignore that (as almost everyone else does), but we can also tackle the complication head on. To do so, we recode weeks 0 and 53 to 1 and 52 respectively, then we aggregate the data so that these weeks feature only once in the resulting dataset, and - most importantly - we create a variable that tracks the number of days in the week, which will be the same for all weeks between 2 and 51, but will differ for weeks 1 and 52. In this way we keep the abnormal starting and ending weeks of the year in the data, but we will be able to account for their unequal length when we analyze the data.
We will also need the total population data for the country, so let’s get it from the same source. In addition to the total population, we can extract the population by sex and age groups. Since the data for 2020 is not yet available, we extrapolate from the last observed values.
# Get the population table
all.pop <- cbs_get_data('7461eng') %>%
cbs_add_label_columns() %>%
mutate (pop = TotalPopulation_1 / 1e6, # divide by a million
year = as.numeric(as.character(Periods_label))) %>%
filter (year > 1994) %>%
dplyr::select (Sex_label, Age_label, year, pop)
# Make function to filter data
pop.data.prep <- function (sex, age_group){
pop.temp <- all.pop %>%
filter(Sex_label == sex, Age_label == age_group) %>%
mutate(change.p = 1+(pop-lag (pop))/lag (pop)) # create a population change variable
pop.temp <- pop.temp %>% # interpolate data for 2020
add_row (year = 2020, pop = last(pop.temp$pop)*mean(tail(pop.temp$change.p,2))) %>% # add data for 2020
dplyr::select (year, pop)
}
# Apply the function to get total population
pop<-pop.data.prep (sex='Men and women', age_group='Total population (all ages)') # all population
# Population per sex
pop.m<-pop.data.prep (sex='Men', age_group='Total population (all ages)') # men only
pop.w<-pop.data.prep (sex='Women', age_group='Total population (all ages)') # women only
# Population per age group
pop.0to20 <- pop.data.prep (sex='Men and women', age_group='0 to 20 years')
pop.20to65 <- pop.data.prep (sex='Men and women', age_group='20 to 65 years')
pop.65plus <- pop.data.prep (sex='Men and women', age_group='65 years or older')
pop.65to70 <- pop.data.prep (sex='Men and women', age_group='65 to 70 years')
pop.70to75 <- pop.data.prep (sex='Men and women', age_group='70 to 75 years')
pop.75to80 <- pop.data.prep (sex='Men and women', age_group='75 to 80 years')
# We gotta do some work to match the age categories from the mortality dataset
pop.0to65 <- left_join(pop.0to20, pop.20to65, by = 'year') %>%
mutate (pop = pop.x + pop.y) %>%
dplyr::select (year, pop)
pop.65to80 <- left_join(pop.65to70, pop.70to75, by = 'year') %>%
left_join(., pop.75to80, by = 'year') %>%
mutate (pop.65=pop.x+pop.y+pop) %>%
dplyr::select (year, pop.65)
pop.80plus <- left_join(pop.65plus, pop.65to80, by = 'year') %>%
mutate (pop.80 = pop - pop.65) %>%
dplyr::select (year, pop.80)
# We can also do the same per sex and age group simultaneously. Maybe later ...
Ideally, we would have population data on a weekly basis to match the mortality data, but this is not available from CBS. Monthly data, however, is available, although not disaggregated by sex and age group. Still, let’s get it in case we want to use it later.
# Get population per month
month.pop <- cbs_get_data('83474ENG') %>%
cbs_add_label_columns() %>%
mutate (pop.m = PopulationAtStartOfSelectedPeriod_1 / 1e6, # divide by a million
year = as.numeric(as.character(substr(Periods_label, 1, 4))),
month = as.character(substr(Periods, 7, 8)),
year.month = paste0(year, '.', month), # create index
filter_n = as.character(substr(Periods, 5,6))) %>%
filter (year > 1994, filter_n != 'JJ') %>%
add_row (year = 2020, month = '12', year.month = '2020.12', pop.m = 17.46964) %>%
dplyr::select (pop.m, year.month)
Now we can merge the mortality and population datasets and calculate the number of deaths per capita (per million of population). We will also add a counter variable for the year and the shares of people between 65 and 80, older than 65, and older than 80 from the total population.
d <- left_join (deaths, pop, by = 'year') %>% # add total population
left_join(., pop.80plus, by='year') %>% # add population above 80
left_join(., pop.65to80, by='year') %>% # add population 65 to 80
left_join(., month.pop, by='year.month') %>% # add monthly population
mutate (deaths.pc = deaths / pop, # calculate deaths per capita
share.80 = pop.80/pop*100, # calculate share of 80+
share.65 = pop.65/pop*100, # calculate share of 65 to 80
share.65plus = share.80 + share.65, # calculate share of 65+
counter = year-2009 # a counter variable for the year
) %>%
filter(year < 2021)
For the rest of the analysis we will work with the weekly data, but let’s quickly visualize the annual mortality trends. Plotting the annual average weekly mortality rate (top panel) and the total number of deaths per year (bottom panel) over the period 1995 to 2020, we notice that the trends prior to 2007 and after 2010 are quite different. This is important because if we want to include a yearly time trend in our predictions for 2020, the older trend prior to 2007 can distort the predictions. (if you are interested why the mortality rate in the Netherlands increases over the past years, have a look here)
mean.year<-d %>%
group_by(year) %>%
summarize(total.deaths = sum (deaths),
mean.deaths.pc=mean(deaths.pc),
pop = mean(pop)
)
par(mfrow=c(2,1), mar=c(4,2,1,1), bty='n')
plot( x = mean.year$year, y = mean.year$mean.deaths.pc, type='l', lwd=2, col='red', ylab='', xlab='Average weekly mortality rate per million per year in the Netherlands')
points( x = mean.year$year, y = mean.year$mean.deaths.pc, col='red', pch=21, bg='white')
plot( x = mean.year$year, y = mean.year$total.deaths, type='l', lwd=2, col='darkblue', ylab='', xlab='Total mortality (number of deaths) per year in the Netherlands')
points( x = mean.year$year, y = mean.year$total.deaths, col='darkblue', pch=21, bg='white')
Now we are ready to start modeling. In the beginning, we will build simple statistical models that will serve as benchmarks against which to compare the more complex ones. In order to evaluate the predictive performance of the models, we will adopt the following approach: First, we remove the years 2020 and 2019 from the data; then, we fit the models only on data until 2018; then, we extrapolate from these models for 2019; and, finally, we compare these extrapolations (predictions) to the actual observed data from 2019. These are called out-of-sample predictions, because the evaluation data is not part of the sample on which the models are estimated. By comparing the predictions from the models to data that is available but has not been used in estimating the parameters of the models, we can get assessments of the model performance that are both ‘honest’ - because we don’t use the same data to build and evaluate the models - and credible - because we evaluate against real and not against simulated data.
The code below creates two susbsets of the data - one to estimate the models and another one to use for evaluation.
w = 52 # last available week
# Model data: filter data before 2010 and after 2018 out
d.s <- d %>%
filter (week > 0, week <= w) %>%
filter (year > 2009, year < 2019)
# Evaluation data: collect the 2019 data in a separate dataset
d.2019 <- d %>%
filter (week > 0, week <= w) %>%
filter (year == 2019)
To compare the predictive performance of the models, we will use two measures: mean absolute prediction error (MAE) and mean squared prediction error (MSE). Both measures take the average of the differences between the predicted and observed values for each week, but MAE takes the absolute values of the differences and MSE takes the squared values of the differences. MAE is useful because it provides a direct measure of the wrongly predicted deaths (either too few or too many for the particular week). MSE is useful because it is more closely related to the error that the models try to minimize within the sample. Note that the MSE penalizes more heavily bigger prediction errors, because the errors are squared. I like MAE because it has an easily interpretable scale: number of deaths, while MSE does not. But there can be reasons to prefer MSE as well.
The first model we build is a linear regression model that features only the week, entered as a categorical variable (week dummies). This is equivalent to using the average for each week over the period 2010-2018 to predict the value for that week for 2019. As simple as this is, this is the approach that many newspapers and websites use, including Our World in Data and EFTA. We can have two versions of the models, one with the mortality rate as the dependent variable (also called outcome or response variable) and one with the raw number of deaths as the dependent variable. This and all models that follow also feature the number of days in the week (remember Week 1 and 52) as a predictor.
### Model 1a, mortality rate
summary(m1a<-lm(deaths.pc ~ as.character(week) + n_days, data=d.s))
##
## Call:
## lm(formula = deaths.pc ~ as.character(week) + n_days, data = d.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.216 -4.968 -0.027 4.064 53.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 182.87890 3.42797 53.349 < 2e-16 ***
## as.character(week)10 1.50545 4.81681 0.313 0.754787
## as.character(week)11 -6.10880 4.81681 -1.268 0.205428
## as.character(week)12 -10.41508 4.81681 -2.162 0.031171 *
## as.character(week)13 -11.10919 4.81681 -2.306 0.021584 *
## as.character(week)14 -16.82115 4.81681 -3.492 0.000531 ***
## as.character(week)15 -18.32488 4.81681 -3.804 0.000164 ***
## as.character(week)16 -20.42713 4.81681 -4.241 2.75e-05 ***
## as.character(week)17 -22.82594 4.81681 -4.739 2.96e-06 ***
## as.character(week)18 -27.22592 4.81681 -5.652 2.95e-08 ***
## as.character(week)19 -22.37280 4.81681 -4.645 4.58e-06 ***
## as.character(week)2 0.08963 4.81681 0.019 0.985162
## as.character(week)20 -27.32243 4.81681 -5.672 2.65e-08 ***
## as.character(week)21 -27.76133 4.81681 -5.763 1.61e-08 ***
## as.character(week)22 -27.79619 4.81681 -5.771 1.55e-08 ***
## as.character(week)23 -30.45460 4.81681 -6.323 6.65e-10 ***
## as.character(week)24 -32.32116 4.81681 -6.710 6.40e-11 ***
## as.character(week)25 -32.03083 4.81681 -6.650 9.27e-11 ***
## as.character(week)26 -30.50321 4.81681 -6.333 6.27e-10 ***
## as.character(week)27 -24.24861 4.81681 -5.034 7.17e-07 ***
## as.character(week)28 -31.96951 4.81681 -6.637 1.00e-10 ***
## as.character(week)29 -29.87333 4.81681 -6.202 1.35e-09 ***
## as.character(week)3 -3.09495 4.81681 -0.643 0.520883
## as.character(week)30 -30.55710 4.81681 -6.344 5.87e-10 ***
## as.character(week)31 -35.31707 4.81681 -7.332 1.20e-12 ***
## as.character(week)32 -34.44498 4.81681 -7.151 3.92e-12 ***
## as.character(week)33 -34.36092 4.81681 -7.134 4.39e-12 ***
## as.character(week)34 -34.95363 4.81681 -7.257 1.97e-12 ***
## as.character(week)35 -34.06704 4.81681 -7.073 6.50e-12 ***
## as.character(week)36 -32.09631 4.81681 -6.663 8.53e-11 ***
## as.character(week)37 -32.71089 4.81681 -6.791 3.87e-11 ***
## as.character(week)38 -30.79526 4.81681 -6.393 4.38e-10 ***
## as.character(week)39 -29.34699 4.81681 -6.093 2.54e-09 ***
## as.character(week)4 -1.31124 4.81681 -0.272 0.785587
## as.character(week)40 -26.81193 4.81681 -5.566 4.68e-08 ***
## as.character(week)41 -23.88730 4.81681 -4.959 1.03e-06 ***
## as.character(week)42 -25.61764 4.81681 -5.318 1.72e-07 ***
## as.character(week)43 -23.66323 4.81681 -4.913 1.30e-06 ***
## as.character(week)44 -24.25378 4.81681 -5.035 7.13e-07 ***
## as.character(week)45 -23.72909 4.81681 -4.926 1.21e-06 ***
## as.character(week)46 -21.64379 4.81681 -4.493 9.10e-06 ***
## as.character(week)47 -19.18852 4.81681 -3.984 8.01e-05 ***
## as.character(week)48 -16.55134 4.81681 -3.436 0.000650 ***
## as.character(week)49 -13.83111 4.81681 -2.871 0.004296 **
## as.character(week)5 -0.40936 4.81681 -0.085 0.932314
## as.character(week)50 -7.62619 4.81681 -1.583 0.114127
## as.character(week)51 -8.29653 4.81681 -1.722 0.085740 .
## as.character(week)52 -7.41466 4.80315 -1.544 0.123421
## as.character(week)6 0.26819 4.81681 0.056 0.955625
## as.character(week)7 0.57736 4.81681 0.120 0.904649
## as.character(week)8 3.50065 4.81681 0.727 0.467784
## as.character(week)9 0.64345 4.81681 0.134 0.893796
## n_days 23.01131 1.23308 18.662 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.15 on 415 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.6996
## F-statistic: 21.92 on 52 and 415 DF, p-value: < 2.2e-16
As we can see from the output, each week is treated as a separate category, and we get a separate regression coefficient for each week (the intercept is the estimate for Week 1). With all these coefficients, the output is not too pretty, so we will not print it anymore, but you can always call it with summary()
. Let’s check that the model is equivalent to taking the weekly averages as predictions.
# it is essentially just the means per week
mean(d.s$deaths.pc[d.s$week==2])
## [1] 182.9685
m1a$coefficients['(Intercept)'] + m1a$coefficients['as.character(week)2']
## (Intercept)
## 182.9685
Yes, it is.
Now we can obtain the model predictions for 2019 and calculate the prediction errors.
# prepare new data for evaluation
predict.m1a.2019<-predict (m1a, newdata=d.2019)
# mean absolute error
mae.m1a.2019<-round (mean(abs(predict.m1a.2019 - d.2019$deaths.pc)), 2)
mae.m1a.2019
## [1] 6.34
# mean squared error
mse.m1a.2019<-round (mean((predict.m1a.2019 - d.2019$deaths.pc)^2), 0)
mse.m1a.2019
## [1] 62
Let’s run the same model but with the raw number of deaths as the dependent variable.
### Model 1b, number of deaths
m1b<-lm(deaths ~ as.character(week) + n_days, data=d.s)
predict.m1b.2019<-predict (m1b, newdata=d.2019) # prepare new data for evaluation
print(mae.m1b.2019<-round (mean(abs(predict.m1b.2019 - d.2019$deaths)), 2)) # mean absolute error
## [1] 164.84
print(mse.m1b.2019<-round (mean((predict.m1b.2019 - d.2019$deaths)^2), 0)) # mean squared error
## [1] 35660
We can see from the output that the MAE of Model 1a (mortality rate) is 6.34. By multiplying this by 52 weeks and 17.28 million (the total population in 2019), we get 5698 wrongly predicted deaths over the year. The equivalent model with raw deaths as the dependent variable (Model 1b) has a MAE of 164.84 and predicted wrongly 8572 deaths. The means of the squared errors (MSE) are not directly comparable between the two models due to the different scales of the dependent variables.
These two models are of course too simple, but they provide a benchmark nonetheless. Next, we gonna add a simple yearly linear trend as a predictor.
### Model 2a, mortality rate
m2a<-lm(deaths.pc ~ as.character(week) + counter + n_days, data=d.s)
predict.m2a.2019<-predict (m2a, newdata=d.2019)
print(mae.m2a.2019<-round (mean(abs(predict.m2a.2019 - d.2019$deaths.pc)), 2) )
## [1] 5.8
print(mse.m2a.2019<-round (mean((predict.m2a.2019 - d.2019$deaths.pc)^2), 0) )
## [1] 50
### Model 2b, number of deaths
m2b<-lm(deaths ~ as.character(week) + counter + n_days, data=d.s)
predict.m2b.2019<-predict (m2b, newdata=d.2019)
print(mae.m2b.2019<-round (mean(abs(predict.m2b.2019 - d.2019$deaths)), 2) )
## [1] 93.77
print(mse.m2b.2019<-round (mean((predict.m2b.2019 - d.2019$deaths)^2), 0))
## [1] 13151
These two models with the time trends perform much better. The coefficients for the linear trends are positive in both models (1.86 and 43.24), meaning that the mortality rate and the number of deaths increase over time. The MAE of the mortality rate model drops to 5.8, which makes for 5212 wrongly predicted deaths over the course of the year. The model of raw deaths performs even better with 4876 wrongly predicted deaths. The MSE of both models also improve relative to the models without the yearly time trends.
One way in which we can significantly improve the fit of the models is to include lagged values of the mortality rate and the number of deaths. Indeed, when we look at the autocorrelation (ACF) and partial autocorrelation (PACF) functions, there is very strong evidence for first and second order autocorrelation. This means that the values of mortality in the current week are strongly and positively correlated with the the values of mortality one, and to a lesser extent, two weeks ahead.
par(mfrow=c(1,2))
acf(d.s$deaths, main='Number of deaths, 2010-2018', col='darkblue')
pacf(d.s$deaths, main='Number of deaths, 2010-2018', col='darkblue')
m4b.lags<-lm(deaths ~ as.character(week) + share.65 + n_days + lag(deaths) + lag(deaths,2), data=d.s)
Indeed, when we add the first and second lags of the number of deaths to our model (Model 4b.lags), the coefficients of both lags are positive and highly significant (0.3 and 0.24 respectively) and the fit of the model increases significantly as well (adjusted r-squared of 0.85, compared to 0.78 of the same model without the lags).
So shall we keep the lags in our model? Not so fast. Having a model with lags will significantly improve our ability to forecast the number of deaths one and two weeks ahead. But this is not our task. What we need to do is to predict the expected pattern for the year ahead without using information from that year that is related to the phenomenon we study. If an external shock, such as COVID-19, starts increasing the number of deaths in week 20, we do not want our predictions to be conditional on the information from that week, because that will partially remove the effect of COVID-19 for weeks 21 and 22, even if it would improve our forecast about how many deaths we will observe in weeks 21 and 22. In fact, it is even not trivial to obtain predictions for 2019 from the model above, given the presence of the lags. Note that using contemporaneous external (exogenous) data for the year we are predicting, such as weather, is fine.
So far, we treated each week as a categorical variable: in other words, the models included a dummy for each week. In reality, we would expect that there are only gradual changes between mortality in consecutive weeks. The week dummies can result in over-fitting without improving the out-of-sample predictions. So we can try to replace the week dummies with a flexible time trend, for example a sine curve that goes smoothly down from winter until late summer and then up again.
m8b<-lm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + n_days, data=d.s)
predict.m8b.2019<-predict (m8b, newdata=d.2019)
print(mae.m8b.2019<-round(mean(abs(predict.m8b.2019 - d.2019$deaths)), 2))
## [1] 75.29
print(mse.m8b.2019<-round(mean((predict.m8b.2019 - d.2019$deaths)^2), 0))
## [1] 9036
The predictive performance of this model is even better than the current benchmark (Model 4b). Given that it is also based on a more principled approach than the week dummies, it is worth keeping it. The figure below shows the predictions from the sine curve model (in red) versus the observed data for 2019 (in blue) and the predictions from the model with the week dummies (in green).
par(mfrow=c(1,1), mar=c(4,2,1,1))
plot (x = d.2019$week, y = predict.m8b.2019, ylim=c(min(c(predict.m8b.2019,d.2019$deaths, predict.m4b.2019)), max(c(predict.m8b.2019,d.2019$deaths, predict.m4b.2019))),
type='l', lwd=2, col='red', ylab='', xlab='Number of deaths per week in 2019: predicted (red and green) and observed (blue)')
lines (x = d.2019$week, y = d.2019$deaths, lwd=2, col='darkblue')
lines (x = d.2019$week, y = predict.m4b.2019, lwd=2, col='darkgreen')
points (x = d.2019$week, y = predict.m8b.2019, pch=21, col='red', bg='white')
points (x = d.2019$week, y = d.2019$deaths, pch=21, col='darkblue', bg='white')
points (x = d.2019$week, y = predict.m4b.2019, pch=21, col='darkgreen', bg='white')
All models so far were linear, but both the mortality rate and the number of deaths per week are overdispersed with higher frequencies of relatively large values than expected under a normal distribution, as we can see from the histograms below:
par(mfrow=c(1,2))
hist (d.s$deaths.pc, main='Mortality rate per week', col='darkgrey', ylab='', xlab='')
hist (d.s$deaths, main='Number of deaths per week', col='darkgrey', ylab='', xlab='')
In response, we can consider generalized linear models which are more appropriate for dependent variables that are not normally distributed. In particular, we fit a negative binomial model with the MASS
package, which can handle overdispersion.
library(MASS)
m8c<-glm.nb(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + n_days, data=d.s)
summary(m8b)
##
## Call:
## lm(formula = deaths ~ sin(2 * pi * week/52) + cos(2 * pi * week/52) +
## share.65 + n_days, data = d.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -344.05 -91.33 -9.76 72.12 953.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1329.102 94.491 14.07 <2e-16 ***
## sin(2 * pi * week/52) 163.720 10.297 15.90 <2e-16 ***
## cos(2 * pi * week/52) 224.990 10.333 21.77 <2e-16 ***
## share.65 110.120 7.285 15.12 <2e-16 ***
## n_days 389.785 18.283 21.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.5 on 463 degrees of freedom
## Multiple R-squared: 0.7637, Adjusted R-squared: 0.7617
## F-statistic: 374.1 on 4 and 463 DF, p-value: < 2.2e-16
predict.m8c.2019<-predict (m5b, type='response', newdata=d.2019)
print(mae.m8c.2019<-round(mean(abs(predict.m8c.2019 - d.2019$deaths)), 2))
## [1] 89.44
print(mse.m8c.2019<-round(mean((predict.m8c.2019 - d.2019$deaths)^2), 0))
## [1] 12005
The predictive performance of the model against the observed 2019 data is pretty good, but no better to our benchmark so far. There still might be an advantage in sticking to the negative binomial model, because it is in principle more appropriate for handling overdispersed count data than a linear model. The disadvantage, however, is that it is more difficult to generate uncertainty estimates about the predictions from the negative binomial model (see below).
Some of the weeks in the 2010-2018 period have registered an abnormal amount of deaths due to flu epidemics, heath waves or something else, and these extreme observations can have undue influence on the models we build and predictions we get from them. Running the regression diagnostics on Model 8b, we can see that three outlying observations in particular (425, 426, 427) have very large influence on the model.
par(mfrow=c(2,2))
plot(m8b)
We can remove these three observations, but a more principled approach is to use robust regression. Robust regression weights the observations differently based on how ‘unexpected’ they are: suspected outliers are not entirely excluded from the data, but contribute less to the model estimates. In R
the MASS
package includes functions for estimating robust regressions. Note that robust regression is different than using robust standard errors.
m8d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + n_days, data=d.s, psi = psi.huber)
predict.m8d.2019<-predict (m8d, newdata=d.2019, type='response')
print(mae.m8d.2019<-round(mean(abs(predict.m8d.2019 - d.2019$deaths)), 2))
## [1] 72.06
print(mse.m8d.2019<-round(mean((predict.m8d.2019 - d.2019$deaths)^2),0))
## [1] 8900
The model above has the best predictive performance so far with a MAE of 72.06, which beats our benchmark so far. The figure below compares the predictions from the robust model (in red) to the ones of the non-robust equivalent Model 8b (in green).
par(mfrow=c(1,1), mar=c(4,2,1,1))
plot (x = d.2019$week, y = predict.m8d.2019, ylim=c(min(c(predict.m8d.2019,d.2019$deaths,predict.m8b.2019)), max(c(predict.m8b.2019,d.2019$deaths,predict.m8d.2019))),
type='l', lwd=2, col='red', ylab='', xlab='Mortality rate in 2019: predicted (red and green) and observed (blue)')
lines (x = d.2019$week, y = d.2019$deaths, lwd=2, col='darkblue')
lines (x = d.2019$week, y = predict.m8b.2019, lwd=2, col='darkgreen')
points (x = d.2019$week, y = predict.m8d.2019, pch=21, col='red', bg='white')
points (x = d.2019$week, y = d.2019$deaths, pch=21, col='darkblue', bg='white')
points (x = d.2019$week, y = predict.m8b.2019, pch=21, col='darkgreen', bg='white')
There is a robust version of the negative binomial model as well, but the implementation in R
is not very user-friendly if one wants to make predictions from the model or get uncertainty estimates in the form of prediction intervals (see here for details).
Some of the outliers in the data might be due to extreme weather - weeks with very low or very high temperature. We can add temperature data to the model to check whether it will make any difference. The rnoaa
package provides access to open historical weather data from around the world, including many stations in the Netherlands. We will use this package to extract a time series of the minimum and maximum daily temperatures from one Dutch weather station, and we will aggregate the daily values into weekly ones to match the mortality and population data. (You will have to download a free key to access the weather data).
### Getting weather data
library(rnoaa)
# set key
# make a function to download per year
get_weather <- function (year, datatype){
temp <-ncdc(datasetid='GHCND', stationid='GHCND:NLM00006235', datatypeid=datatype,
startdate = paste0(year,'-01-01'), enddate = paste0(year,'-12-31'),limit=732)}
# apply the function for all years between 1995 and 2020
for (i in 2008:2020){
assign(paste0('nlw', i), get_weather (year = i, datatype = c('TMIN', 'TMAX')))
}
# merge all yearly datasets
nlw <- ncdc_combine(nlw2008, nlw2009, nlw2010, nlw2011, nlw2012, nlw2013, nlw2014, nlw2015, nlw2016, nlw2017, nlw2018, nlw2019, nlw2020)
# make new variables
nlw <- nlw$data %>%
dplyr::select (date, value, datatype) %>%
mutate (date = substr(date, 1, 10),
temp = value/10) %>%
dplyr::select (-value) %>%
spread(datatype, temp)
# aggregate at week level
nlw.w <- nlw %>%
read.zoo() %>%
apply.weekly(mean, na.rm=T) %>%
transform (year = as.numeric(as.character(strftime(.,format = "%Y"))),
week = as.numeric(as.character(strftime(.,format = "%V"))),
date = as.character(attributes(.)$index))
# manually clean up some week nubmer issues
nlw.w$id <- paste0(nlw.w$year, '.', trimws(nlw.w$week))
nlw.w$id <- ifelse (nlw.w$date =='2017-01-01', '2016.52',nlw.w$id)
nlw.w$id <- ifelse (nlw.w$date =='2012-01-01', '2011.52',nlw.w$id)
nlw.w$id <- ifelse (nlw.w$date =='2011-01-02', '2010.52',nlw.w$id)
nlw.w <- nlw.w[nlw.w$week!=53,]
# transform variables
nlw.w = data.frame(nlw.w) %>%
mutate (temp.max = round(as.numeric(as.character(TMAX)), 2),
temp.min = round(as.numeric(as.character(TMIN)), 2),
t.min = case_when (temp.min>0 ~ 0, TRUE ~ 0-temp.min),
t.max = case_when (temp.max<20 ~ 0, TRUE ~temp.max-20)
) %>%
dplyr::select (id, temp.max, temp.min, t.min, t.max)
Now we can attach the newly-acquired weather data to the rest of the data.
d$id <- paste0(d$year, '.', d$week)
d <- left_join(d, nlw.w, by='id')
# Now is a good moment to save the data
write.csv(d, file='nl_mortality_data.csv')
# Filter again data before 2010 and after 2018 out
d.s <- d %>%
filter (week > 0, week <= w) %>%
filter (year > 2009, year < 2019)
# Collect the 2019 data in a separate dataset
d.2019 <- d %>%
filter (week > 0, week <= w) %>%
filter (year == 2019)
We have to decide how to enter temperature in the models. We would expect that temperature in a normal range does not influence mortality, but beyond and above some thresholds it does. To decide where are these thresholds, we can plot the residuals from one of our models (Model 8b) against the weekly minimum and maximum temperature and check at which points the residuals start to increase.
par(mfrow=c(2,1))
lw1 <- loess(m8b$residuals ~ d.s$temp.min)
plot( d.s$temp.min, m8b$residuals, xlab='Weekly average of daily minimum temperature', ylab='residuals')
j <- order(d.s$temp.min)
lines(d.s$temp.min[j],lw1$fitted[j],col="red",lwd=3)
abline(h=0, col='darkblue')
lw2 <- loess(m8b$residuals ~ d.s$temp.max)
plot( d.s$temp.max, m8b$residuals, xlab='Weekly average of daily maximum temperature', ylab='residuals')
j <- order(d.s$temp.max)
lines(d.s$temp.max[j],lw2$fitted[j],col="red",lwd=3)
abline(h=0, col='darkblue')
As we can see, the residuals start to increase once the minimum temperature drops below 0 degrees (Celsius) and once the maximum temperature goes above 20 degrees. That’s why we recoded the minimum and maximum temperatures as 0s in the range 0 to 20 degrees, and then we inverted the minimum temperatures so that higher values mean greater distance from the normal range. Now we are ready to add temperature to the models.
m9d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + n_days + t.min + t.max, data=d.s, psi = psi.huber)
summary(m9d)
##
## Call: rlm(formula = deaths ~ sin(2 * pi * week/52) + cos(2 * pi * week/52) +
## share.65 + n_days + t.min + t.max, data = d.s, psi = psi.huber)
## Residuals:
## Min 1Q Median 3Q Max
## -304.303 -72.412 -2.175 76.981 1015.209
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 1548.0855 76.3247 20.2829
## sin(2 * pi * week/52) 154.6607 8.4995 18.1964
## cos(2 * pi * week/52) 223.7991 9.0839 24.6368
## share.65 91.1176 5.8853 15.4821
## n_days 388.1784 14.6404 26.5142
## t.min 17.5293 6.9451 2.5240
## t.max 37.9261 6.4541 5.8763
##
## Residual standard error: 112.7 on 461 degrees of freedom
predict.m9d.2019<-predict (m9d, newdata=d.2019)
print(mae.m9d.2019<-round(mean(abs(predict.m9d.2019 - d.2019$deaths)), 2) )
## [1] 61.43
print(mse.m9d.2019<-round(mean((predict.m9d.2019 - d.2019$deaths)^2), 0) )
## [1] 5778
The addition improves the models quite a bit. Both the minimum and maximum temperatures have big and significant effects. The MAE of the predictions drops to 61.43, which makes for only 3194 mis-predicted deaths for the whole year 2019. The negative binomial alternative is only slightly worse:
m9c<-glm.nb(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + n_days + t.min + t.max, data=d.s)
predict.m9c.2019<-predict (m9c, type='response', newdata=d.2019)
print(mae.m9c.2019<-round(mean(abs(predict.m9c.2019 - d.2019$deaths)), 2))
## [1] 63.73
print(mse.m9c.2019<-round(mean((predict.m9c.2019 - d.2019$deaths)^2), 0))
## [1] 5980
To sum up, the best models so far: (1) model the raw number of deaths rather than mortality rate, (2) include a sine wave to model variation between weeks over the course of the year rather than having week dummies, (3) include predictors for temperature lower than 0 and higher than 20 and a variable tracking the number of days in a week, (4) include the share of old people (between 65 and 80 works best) rather than a linear yearly trend or the total number of population, even if measured on a monthly basis, (5) are based either on robust linear or negative binomial specification. The best model has rather impressive performance: the total number of mis-predicted deaths for 2019 is 3314. This makes for 2% of all recorded deaths in 2019. Moreover, the best model would predict a total of 152773 deaths, which is very, very close to the actual 151885 deaths that were recorded in 2019 (the difference is less than 1%).
Of course, there is nothing special about evaluating our out-of-sample predictions on the data from 2019, so we can repeat the steps above but use the data from 2018, 2017, 2016 and 2015 as the prediction evaluation benchmarks. The code below specifies a list of models, re-estimates the models on different subsets of the data, compares the predictive performance of these models against different benchmark years, and summarizes the results (MAE in particular) in a table.
m1b<-lm(deaths ~ as.character(week) + n_days, data=d.s)
m2b<-lm(deaths ~ as.character(week) + counter + n_days, data=d.s)
m3b<-lm(deaths ~ as.character(week) + t.min + t.max + counter + n_days, data=d.s)
m3c<-glm.nb(deaths ~ as.character(week) + t.min + t.max + counter + n_days, data=d.s)
m2d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + counter, data=d.s, psi = psi.huber)
m3d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + pop, data=d.s, psi = psi.huber)
m4d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + share.65, data=d.s, psi = psi.huber)
m5d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + share.80, data=d.s, psi = psi.huber)
m6d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + share.65plus, data=d.s, psi = psi.huber)
m7d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + pop.m, data=d.s, psi = psi.huber)
m4c<-glm.nb(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + share.65, data=d.s)
models <- c('m1b', 'm2b', 'm3b', 'm3c', 'm2d', 'm3d', 'm7d', 'm5d', 'm6d', 'm4d','m4c')
out2 <- data.frame (matrix(NA, nrow = length(models), ncol = 2 + 1*5))
colnames(out2) <- c('Model','Features', paste0('MAE.', 2015:2019))
out2$Model <- models
out2$Features <- c('weeks, linear', 'trend, weeks, linear', 'temp, trend, weeks, linear', 'temp, trend, weeks, negbin',
'temp, trend, sine, robust', 'temp, pop, sine, robust','temp, month.pop, sine, robust',
'temp, share.80+, sine, robust', 'temp, share.65+, sine, robust',
'temp, share.65-80, sine, robust', 'temp, share.65-80, sine, negbin')
for (i in 2015:2019) {
d.model <- d %>%
filter (week > 0, week <= w) %>%
filter (year > (i - 9), year < i)
d.eval <- d %>%
filter (week > 0, week <= w) %>%
filter (year == i)
for (j in 1:length(models)){
out2[j,2+(i-2014)]<-round(mean(abs(predict (update(get(models[j]), data=d.model), type='response', newdata=d.eval) - d.eval$deaths)), 0)
# assign(paste0('mae.', models[j],'.', i), round(mean(abs(predict (update(get(models[j]), data=d.model), type='response', newdata=d.eval) - d.eval$deaths)), 2)) # use this if we want to extract the predictions themselves
}
}
out2$MAE.MEAN<-round(rowMeans(out2[,3:7]), 0)
kable(out2, caption = 'Comparing the predictive performance of the models against 2015 to 2019 data', align=c('l',rep('r',7)))
Model | Features | MAE.2015 | MAE.2016 | MAE.2017 | MAE.2018 | MAE.2019 | MAE.MEAN |
---|---|---|---|---|---|---|---|
m1b | weeks, linear | 207 | 193 | 193 | 218 | 150 | 192 |
m2b | trend, weeks, linear | 135 | 94 | 124 | 171 | 98 | 124 |
m3b | temp, trend, weeks, linear | 130 | 88 | 126 | 171 | 88 | 121 |
m3c | temp, trend, weeks, negbin | 132 | 87 | 126 | 165 | 86 | 119 |
m2d | temp, trend, sine, robust | 143 | 99 | 135 | 180 | 62 | 124 |
m3d | temp, pop, sine, robust | 147 | 102 | 141 | 191 | 74 | 131 |
m7d | temp, month.pop, sine, robust | 148 | 100 | 146 | 196 | 71 | 132 |
m5d | temp, share.80+, sine, robust | 142 | 104 | 133 | 173 | 62 | 123 |
m6d | temp, share.65+, sine, robust | 139 | 98 | 132 | 172 | 60 | 120 |
m4d | temp, share.65-80, sine, robust | 139 | 98 | 132 | 172 | 60 | 120 |
m4c | temp, share.65-80, sine, negbin | 135 | 83 | 127 | 169 | 64 | 116 |
The results are not very clear cut. First, it seems that 2019 was an ‘easy’ year to predict - all models have the best predictive performance on that year, followed by 2016. The prediction errors are more than twice as big in 2015 and 2017 and even more so in 2018. This does not indicate necessarily a problem with the models: it could be that 2018 was indeed exceptional in terms of mortality due to a heavy flu epidemic or other causes. It just means that 2018 was less predictable than 2016 or 2019 given our predictors and modeling approach.
Comparing the relative performance of the different models, there are no huge differences when we look at the averages over all years (last column). And in different years different model specifications come on top. Model 4C, which is a negative binomial specification that includes a sine wave, temperature extremes and the share of 65-to-80 population has the lowest average MAE overall. But it is the best model only evaluated against the data from 2016. The robust linear specification with the same predictors is third overall and the best in 2019, as we already noted before. Somewhat surprisingly, a linear model that has a simple linear trend, week dummies and temperature as predictors (Model 3b in the table above) performs on average very well, and it is the best model in 2015. The negative binomial version (Model 3c) is second overall and the best one in 2018. The one model that is clearly much worse than the rest is the linear model with week dummies only, which is the equivalent of taking the weekly averages from the previous years. Let’s note again that this remains a popular approach despite its obvious shortcomings.
The list of models in the table above does not exhaust all possible combinations of predictors and other modeling choices. For example, we can run the same set of models with the mortality rate as the dependent variable or include simultaneously different sub-population shares in the same model. I tried some of these and found that working with mortality rates is not better but not much worse than working with the raw number of deaths and that entering more than one population-based variable is not helpful. But you are free to adapt the code to try different variations yourself.
Before we conclude the overview of possible modeling approaches, let’s try one more thing. What if we had data for some of the early weeks from the year for which we are trying to predict mortality? Could we use this information to improve the predictive accuracy of the model for the rest of the year? We can simulate this situation in the following way: first, we calculate the mean number of deaths in the first weeks [2 to 10] for each year and we attach this variable to the dataset; then, we exclude all weeks 1 to 10 from the dataset; then, we estimate the models on the remaining weeks, adding this new variable capturing the mortality in the first weeks of the year; lastly, we predict from the model for 2019, using the observed mean mortality in weeks 2 to 10 of 2019 as an additional predictor. To conclude whether adding this variable helps, we would want to see an improvement in the predictive accuracy of the model compared to one without this new variable.
# get the subsets of the data
d.s2<-d.s %>%
group_by(year) %>%
mutate(mean10 = mean(deaths[2:10])) %>% # add the yearly mean mortality during weeks 2 to 10
filter(week>10) # remove the weeks used for the calculation.
d.2019.2<-d.2019 %>%
mutate(mean10 = mean(deaths[2:10])) %>%
filter(week>10)
# re-estimate one of our models on the new subsetted data
m4e<-glm.nb(deaths ~ sin(2*pi*week/42) + cos(2*pi*week/42) + t.min + t.max + n_days + share.65, data=d.s2)
predict.m4e.2019<-predict (m4e, newdata=d.2019.2, type='response')
print(mae.m4e.2019<-round(mean(abs(predict.m4e.2019 - d.2019.2$deaths)), 2))
## [1] 72.61
# now add the new variable and compare
m4f<-glm.nb(deaths ~ sin(2*pi*week/42) + cos(2*pi*week/42) + t.max + t.min + n_days + share.65 + mean10, data=d.s2)
summary(m4f)
##
## Call:
## glm.nb(formula = deaths ~ sin(2 * pi * week/42) + cos(2 * pi *
## week/42) + t.max + t.min + n_days + share.65 + mean10, data = d.s2,
## init.theta = 874.2434301, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2711 -0.5815 -0.0517 0.5658 6.1952
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.504e+00 2.651e-02 283.071 < 2e-16 ***
## sin(2 * pi * week/42) 7.484e-02 3.187e-03 23.483 < 2e-16 ***
## cos(2 * pi * week/42) 5.531e-03 2.850e-03 1.941 0.05230 .
## t.max 1.449e-02 2.051e-03 7.067 1.59e-12 ***
## t.min 1.432e-02 5.216e-03 2.745 0.00605 **
## n_days 1.139e-01 6.230e-03 18.288 < 2e-16 ***
## share.65 2.919e-02 3.697e-03 7.898 2.85e-15 ***
## mean10 2.465e-07 1.243e-05 0.020 0.98417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(874.2434) family taken to be 1)
##
## Null deviance: 1696.1 on 377 degrees of freedom
## Residual deviance: 373.7 on 370 degrees of freedom
## AIC: 4598.6
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 874.2
## Std. Err.: 83.4
##
## 2 x log-likelihood: -4580.598
predict.m4f.2019<-predict (m4f, type='response', newdata=d.2019.2)
print(mae.m4f.2019<-round(mean(abs(predict.m4f.2019 - d.2019.2$deaths)), 2))
## [1] 72.67
The coefficient for the new variable - the mean mortality observed in the first weeks of the year - is not significant and the model with this variable does not have lower MAE than the one without it. We can conclude that adding this information does not improve our predictions for the year ahead.
Having reviewed a large number of models, we can settle on two contender models - one based on a robust linear regression specification and one based on a negative binomial specification, both having the raw number of deaths as the dependent variable and a sine trend, the share of 65-to-80 years old, temperature and the number of days in the week as predictors. What remains is to derive some measure of the uncertainty of our predictions. There are various sources of uncertainty relevant for the predictions, but our focus is on the uncertainty of the predictions conditional on the estimated model (thus, ignoring uncertainty about the correct model specification itself).
To measure the uncertainty of the model predictions, we can use the so-called prediction intervals. These should not be confused with confidence intervals (which provide a measure of uncertainty about the predicted value of the outcome conditional on some values of the predictors), which are in general considerably smaller.
For the robust linear regression model, we can easily obtain the prediction intervals just by requesting them from the predict()
function which we have already been using.
m4d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + share.65, data=d.s, psi = psi.huber)
predict.m4d.2019<-predict (m4d, newdata=d.2019, se.fit=TRUE,interval="prediction", level=0.95)
predict.m4d.2019c<-predict (m4d, newdata=d.2019, se.fit=TRUE,interval="confidence", level=0.95)
predict.m4d.2019.mean <- predict.m4d.2019$fit[,1]
predict.m4d.2019.plus <- predict.m4d.2019$fit[,3]
predict.m4d.2019.minus <- predict.m4d.2019$fit[,2]
predict.m4d.2019.c.plus <- predict.m4d.2019c$fit[,3]
predict.m4d.2019.c.minus <- predict.m4d.2019c$fit[,2]
par(mfrow=c(1,1), mar=c(4,2,1,1))
plot (x = d.2019$week, y = predict.m4d.2019.mean, ylim=c(min(c(predict.m4d.2019.minus,d.2019$deaths)), max(c(predict.m4d.2019.plus,d.2019$deaths))),
type='l', lwd=2, col='red', ylab='', xlab='Mortality in 2019: predicted (red) and observed (blue)')
points (x = d.2019$week, y = predict.m4d.2019.mean, pch=21, col='red', bg='white')
lines (x = d.2019$week, y = d.2019$deaths, lwd=2, col='darkblue')
points (x = d.2019$week, y = d.2019$deaths, pch=21, col='darkblue', bg='white')
lines (x = d.2019$week, y = predict.m4d.2019.plus, lwd=1, col='red', lty=2)
lines (x = d.2019$week, y = predict.m4d.2019.minus, lwd=1, col='red', lty=2)
lines (x = d.2019$week, y = predict.m4d.2019.c.plus, lwd=1, col='darkgrey', lty=3)
lines (x = d.2019$week, y = predict.m4d.2019.c.minus, lwd=1, col='darkgrey', lty=3)
As we can see, the prediction intervals are very large. Indeed, they appear too conservative given that none of the observed values in 2019 lie outside the intervals. The gray lines show the confidence intervals just for illustration: these are much narrower, but they are not appropriate for estimating the uncertainty of our predictions. We can check the coverage of the prediction intervals on the data from which the model has been estimated.
predict.m4d.2019.all<-predict (m4d, se.fit=TRUE,interval="prediction", level=0.95)
predict.m4d.2019.all.mean <- predict.m4d.2019.all$fit[,1]
predict.m4d.2019.all.plus <- predict.m4d.2019.all$fit[,3]
predict.m4d.2019.all.minus <- predict.m4d.2019.all$fit[,2]
par(mfrow=c(1,1), mar=c(4,2,1,1), bty='o')
plot (x = 1:468, y = predict.m4d.2019.all.mean, ylim=c(min(c(predict.m4d.2019.all.minus,d.s$deaths)), max(c(predict.m4d.2019.all.plus,d.s$deaths))),
type='l', lwd=2, col='red', ylab='', xlab='Mortality 2010-2018: predicted (red) and observed (blue)')
lines (x = 1:468, y = d.s$deaths, lwd=0.5, col='blue')
points (x = 1:468, y = d.s$deaths, pch=21, col='blue', bg='white', cex=0.75)
lines (x = 1:468, y = predict.m4d.2019.all.plus, lwd=1, col='red', lty=2)
lines (x = 1:468, y = predict.m4d.2019.all.minus, lwd=1, col='red', lty=2)
abline(v=seq(1,468,52), col='grey', lwd=0.5)
In the ‘training’ dataset from which the model has been estimated, the prediction intervals have proper coverage with occasional values outside the 95% prediction intervals. These are mostly associated with the peaks of flu epidemics during the winter weeks of some years. Since there is nothing in our model that captures the influence of the flu epidemics (other than the seasonal cycles and the minimum temperature), the model is not very good in predicting these, and as a result the prediction intervals are quite large. Note that there are not many such mistakes of the model when it comes to predicting too many deaths in a week: almost all the observations outside the 95% prediction intervals are cases of underprediction.
Does the negative binomial specification, which is specifically designed for overdispersion in the dependent variable, handle this better? It is only slightly more complicated to get prediction intervals from the negative binomial model.
library(ciTools)
m4c<-glm.nb(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + t.min + t.max + n_days + share.65, data=d.s)
predict.m4c.2019<-add_pi(fit=m4c, df=d.2019) # predictions for 2019
predict.m4c.2019.all<-add_pi(fit=m4c, df=d.s) # in-sample predictions
As we can see, there are no huge differences, rather than for the misbehaved weeks 1 and 52.
par(mfrow=c(1,1), mar=c(4,2,1,1))
plot (x = d.2019$week, y = predict.m4c.2019$pred, ylim=c(min(c(predict.m4c.2019$LPB0.025,d.2019$deaths)), max(c(predict.m4c.2019$UPB0.975,d.2019$deaths))),
type='l', lwd=2, col='red', ylab='', xlab='Mortality in 2019: predicted (red) and observed (blue)')
lines (x = d.2019$week, y = d.2019$deaths, lwd=2, col='darkblue')
points (x = d.2019$week, y = d.2019$deaths, pch=21, col='darkblue', bg='white')
points (x = d.2019$week, y = predict.m4c.2019$pred, pch=21, col='red', bg='white')
lines (x = d.2019$week, y = predict.m4c.2019$LPB0.025, lwd=1, col='red', lty=2)
lines (x = d.2019$week, y = predict.m4c.2019$UPB0.975, lwd=1, col='red', lty=2)
par(mfrow=c(1,1), mar=c(4,2,1,1))
plot (x = 1:468, y = predict.m4c.2019.all$pred, ylim=c(min(c(predict.m4c.2019.all$LPB0.025,d.s$deaths)), max(c(predict.m4c.2019.all$UPB0.975,d.s$deaths))),
type='l', lwd=2, col='red', ylab='', xlab='Mortality 2010-2018: predicted (red) and observed (blue)')
lines (x = 1:468, y = d.s$deaths, lwd=0.5, col='darkblue')
points (x = 1:468, y = d.s$deaths, pch=21, col='darkblue', bg='white')
lines (x = 1:468, y = predict.m4c.2019.all$LPB0.025, lwd=1, col='red', lty=2)
lines (x = 1:468, y = predict.m4c.2019.all$UPB0.975, lwd=1, col='red', lty=2)
abline(v=seq(1,468,52), col='grey', lwd=1)
Finally, we are ready to unleash our models to predict the 2020 mortality data. We have selected predictors, we have identified both robust linear regression and negative binomial regression as appropriate modeling choices, and we can quantify prediction uncertainty from the models. Let’s do it then!
# Model data
d.model <- d %>%
filter (week > 0, week <= w) %>%
filter (year > 2009, year < 2020)
# Evaluation data from 2020
d.eval <- d %>%
filter (week > 0, week <= w) %>%
filter (year == 2020)
# Robust linear regression
m4d<-rlm(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + t.min + t.max + n_days, data=d.model, psi = psi.huber)
summary(m4d)
##
## Call: rlm(formula = deaths ~ sin(2 * pi * week/52) + cos(2 * pi * week/52) +
## share.65 + t.min + t.max + n_days, data = d.model, psi = psi.huber)
## Residuals:
## Min 1Q Median 3Q Max
## -327.495 -71.589 -2.209 74.649 1005.207
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 1514.2461 65.7462 23.0317
## sin(2 * pi * week/52) 156.4283 7.6833 20.3594
## cos(2 * pi * week/52) 228.4216 8.1860 27.9041
## share.65 93.9494 5.0053 18.7701
## t.min 16.2728 6.5974 2.4665
## t.max 38.5986 5.4539 7.0772
## n_days 392.8631 13.5146 29.0695
##
## Residual standard error: 109.6 on 513 degrees of freedom
predict.m4d.2020<-predict (m4d, newdata=d.eval)
print(mae.m4d.2020<-round(mean(abs(predict.m4d.2020 - d.eval$deaths)), 2))
## [1] 356.98
print(mse.m4d.2020<-round(mean((predict.m4d.2020 - d.eval$deaths)^2), 0))
## [1] 366129
# Negative binomial regression
m4c<-glm.nb(deaths ~ sin(2*pi*week/52) + cos(2*pi*week/52) + share.65 + t.min + t.max + n_days, data=d.model)
predict.m4c.2020<-predict (m4c, type='response', newdata=d.eval)
print(mae.m4c.2020<-round(mean(abs(predict.m4c.2020 - d.eval$deaths)), 2))
## [1] 348.68
print(mse.m4c.2020<-round(mean((predict.m4c.2020 - d.eval$deaths)^2), 0))
## [1] 327988
The MAE of the predictions is 357 which implies 322687 ‘mispredicted’ deaths, and the MSE is 366129. Both of these measures are way outside the range of what we got when we evaluated our models against any year between 2015 and 2019. Our predictions are two times worse than the predictions for 2018, which were the worst ones so far. This means that mortality during 2020 was very, very unusual and unpredictable based on models that worked very well for predicting mortality for any year between 2015 and 2019.
Now let’s add prediction intervals and plot. The blue line and dots show the observed mortality during 2020 (as currently measured by the CBS). The red lines show the predictions from the robust linear regression model together with the 95% prediction intervals, and the green lines show the same for the negative binomial model. There are negligible differences between the two model versions, although the prediction intervals for the negative binomial model are slightly larger. Note how many observations are outside of the 95% prediction intervals, and remember how none of the observed values for 2019 were even close to the boundaries of the prediction intervals.
predict.m4d.2020.2<-predict (m4d, newdata=d.eval, se.fit=TRUE, interval="prediction", level=0.95)
predict.m4d.2020.mean <- predict.m4d.2020.2$fit[,1]
predict.m4d.2020.plus <- predict.m4d.2020.2$fit[,3]
predict.m4d.2020.minus <- predict.m4d.2020.2$fit[,2]
predict.m4c.2020.2<-add_pi(fit=m4c, df=d.eval) # predictions for 2020
par(mfrow=c(1,1), mar=c(4,2,1,1))
plot (x = d.eval$week, y = predict.m4d.2020.mean, ylim=c(min(c(predict.m4d.2020.minus,d.eval$deaths)), max(c(predict.m4d.2020.plus,d.eval$deaths))),
type='l', lwd=2, col='red', ylab='', xlab='Mortality in 2020: predicted (red and green) and observed (blue)')
lines (x = d.eval$week, y = d.eval$deaths, lwd=2, col='darkblue')
points (x = d.eval$week, y = d.eval$deaths, pch=16, col='darkblue')
lines (x = d.eval$week, y = predict.m4c.2020.2$pred, lwd=2, col='darkgreen')
lines (x = d.eval$week, y = predict.m4c.2020.2$LPB0.025, lwd=1, col='darkgreen', lty=2)
lines (x = d.eval$week, y = predict.m4c.2020.2$UPB0.975, lwd=1, col='darkgreen', lty=2)
lines (x = d.eval$week, y = predict.m4d.2020.plus, lwd=1, col='red', lty=2)
lines (x = d.eval$week, y = predict.m4d.2020.minus, lwd=1, col='red', lty=2)
To be precise, according to the best models we have, excess mortality in 2020 has been 15613 deaths. This amounts to 10.21% from all predicted deaths during the year. Should all excess mortality be attributed to COVID-19? There have been no wars or natural disasters, there is no evidence for a massive flu epidemic during the year, and the mortality due to the summer heatwave seems to be predicted reasonably well by the model, so it is already taken into account. This leaves COVID-19 as the force behind the excess mortality.
Coming back to the original goal of this modeling exercise, we can say that our best estimate for the impact of COVID-19 on mortality in the Netherlands in 2020 is an excess of 15613 deaths.
In the course of the evaluation exercise we found out that some modeling decisions made considerable difference to the predictive performance of the models, while others did not, or did so only occasionally. What recommendations about modeling mortality data can we make on the basis of our evaluation exercise?
Let’s note again that taking only the weekly averages from the past years to predict mortality in 2020 and identify excess mortality proved to be the worst method in our evaluation exercise. Having only a yearly trend is also inadequate. Remarkably, even the state of the art models and estimates provided by EUROMOMO seem to be based on a model that does not include population numbers or sub-population shares, which we saw provide considerable predictive leverage (see for example here or the model on page 48 here).
It is interesting and important to discuss further how the estimates of excess mortality provided here compare to other estimates and how it compares to the official death toll of the pandemic measured from individual death certificates. We can also explore how excess mortality is distributed across different sex and age groups. And we can make much better graphs visualizing the impact of COVID-19 on mortality in 2020. But all this requires a separate post.
For now, we’re done!