Short-term local predictions of COVID-19 in the UK using dynamic supervised machine learning algorithms


Our primary goal was to develop data-driven machine learning models for 1-, 2-, and 3-week predictions of growth rates in the COVID-19 cases (defined as 1-, 2-, and 3-week growth rate, respectively) at the level of the lower local authorities (LTLA) in the UK. In the UK, COVID-19 cases are reported by date of publication (ie the date the case was registered in the reporting system) and by the date of specimen collection. Therefore, there were six prediction targets in our study, 1-, 2-, and 3-week growth rates at the date of publication and those at the date of specimen collection (Table 1). We focused on publication date prediction in the main models, as the delayed reporting of COVID-19 cases at the specimen collection date could affect the real-time assessment of model performance (i.e. the prediction would be biased downward due to delayed reporting) .

Table 1 Prediction Goals.

Data sources

We’ve analyzed the Google Search Trends symptom dataset5the Google Community Mobility Reports19.20COVID-19 vaccination coverage and number of confirmed COVID-19 cases for the UK1. This data has been formatted and aggregated from daily to weekly level where necessary and then linked by week and LTLA. We only considered the time series from June 1, 2020 (defined as week 1) for modeling, as the LTLA-level case reporting after June 1, 2020 was relatively consistent and reliable. Modeling work initially began on May 15, 2021 and has been continuously updated with the latest data available since then; when models were modified, only the versions of the data that were available in real time were used. In this study, we used November 14, 2021 as the reporting cut-off time (i.e. data between June 1, 2020 and November 14, 2021 was included for modeling), although our model is regularly updated.

Google’s symptom search trends show the relative popularity of symptoms in searches within a geographic area over time21. We used the percentage change in symptom seeking for each week during the pandemic compared to the pre-pandemic period (the three-year average for the same week in 2017-2019). We took into account 173 symptoms in the analyses, for which the search trends had a high degree of completeness. These search trends were provided by senior local authorities and extrapolated to each LTLA. Google’s mobility dataset records the daily mobility of the population against a baseline for six specific areas, namely workplaces, residential areas, parks, shopping and recreational areas, supermarkets and pharmacies, and transit stations22. The weekly averages of each of the six mobility metrics for each LTLA were the model inputs. Mobility in Hackney and City of London LTLAs averaged as they were grouped into one LTLA in other datasets. Cornwall and Isles of Scilly were also combined. The COVID-19 vaccination coverage dataset records the cumulative percentage of the population vaccinated with the first dose of vaccine and that for the second dose on each day. Before the start of the vaccination rollout (December 7, 2020 for the first dose and December 28, 2020 for the second dose), coverage was considered zero. We used the weekly maximum cumulative percentage of people vaccinated for the first dose and the second dose for each LTLA in our models. Missing values ​​for search trends on symptoms, mobility and vaccination coverage were imputed using linear interpolation for each LTLA23. Thirteen LTLAs were excluded because the data was insufficient to allow for linear interpolation.


Model selection algorithm

We have developed a dynamic, controlled machine learning algorithm based on log-linear regression. The algorithm could ensure that the optimal prediction models may vary over time given the best data available to date, and therefore reflect the best real-time prediction given all the available data.

Figure 1 shows the iteration of model selection and assessment. We started with a basic model24 that included LTLA (as dummy variables), Google’s six mobility metrics, first and second dose vaccination rates, and eight basic symptoms from the Google search trends for symptoms, including cough, fever, fatigue, diarrhea, vomiting, shortness of breath, confusion, and pain on the lower back. chest, which were most relevant to COVID-19 symptoms based on existing evidence25. Dysgeusia and anosmia as the other two main symptoms of COVID-1926 were not included as baseline symptoms because Google search data for the two symptoms alone was sufficient to allow for modeling in approximately 56% of the LTLAs (the two symptoms were included as baseline symptoms in the sensitivity analysis described below). We then selected and assessed the optimal lag combination15,27,28 between each predictor and growth rate. Next, starting from the eight basic symptoms, we applied a forward data-driven method to include additional symptoms in the model. This would allow the inclusion of other symptoms that could improve the predictability of the model. Finally, we assessed the different predictor combinations (Figure 1; Supplementary Methods and Supplementary Table 1).

fig. 1: Schematic depiction of model selection and assessment.
Figure 1

SE square error, MSE mean square error. In each of the assessment steps, the optimal model had the smallest MSE. Xm1(t) until Xm6(t): mobility statistics in six locations. Xs1(t) to Xs8(t): search statistics of the eight basic symptoms. Xv1(t) and Xv2(t): COVID-19 vaccination coverage for the first and second dose. Details are in additional method.

At each of the steps, model performance was assessed by calculating a mean square error (MSE) of the predictions over the previous four weeks, i.e., the 4-week MSE, evaluating the MSE for each week separately by matching the same candidate. model (Fig. 1 and additional methods). The calculated 4-week MSE reflected the mean predictability of candidate models over the past four weeks (referred to as 4-week retrospective MSE). Models with a minimum of 4 weeks of MSE were considered for inclusion in each step. Separate model selection processes were performed for each of the prediction targets.

In addition, we considered naive models as alternative model candidates for selection; naive models (which assumed no changes in growth rate) carried over the last available observation for each of the outcomes as the prediction. As with the full models (i.e., predictor models), we accounted for a delay between zero and three weeks and used the 4-week MSE for naive models (Supplementary Table 2).

Prospective evaluation of model predictability

After selecting the optimal model based on the 4-week retrospective MSE, we proceeded to prospectively evaluate the predictability of the model by calculating the prediction errors for predictions of growth rates in the following 1-3 weeks (for the three prediction time frames). ), referred to as future MSE (Supplementary Methods and Supplementary Table 3). As the optimal prediction models changed over time under our modeling framework, we a priori selected eight checkpoints spaced five weeks apart to assess the predictability of the model (we did not assess each week due to the significant computational time required) : year 1/week 40 (the week of March 1, 2021), 1/45 (April 5), 1/50 (May 10), 2/3 (June 14), 2/8 (July 19), 2/13 (August 30), 2/18 (October 4) and 2/23 (November 14). For each checkpoint, we presented the composition of the optimal models, as well as the corresponding future MSE.

Two reference models were used to help evaluate our dynamic optimal models. We considered naive models (with optimal lag based on a 4-week retrospective MSE) as the first reference model, to understand to what extent the models driven by covariates could outperform models assuming status quo. As a second reference model, to further demonstrate the advantages of our dynamic model selection approach over the conventional model with a fixed list of predictors, we used the optimal model for the first checkpoint (i.e., year 1/week 40) and identified the covariates ( referred to as a fixed predictor model); we then compared the future MSEs for the following seven control points (ie from year 1/week 45), which allowed the model coefficients to vary.

Sensitivity analyses

As a sensitivity analysis, the baseline symptoms were expanded to include dysgeusia and anosmia, as well as headache, nasal congestion and sore throat that have recently been reported as common symptoms of COVID-1917 to assess how predictive accuracy was affected.

web application

We have developed a web application COVIDPredLTLA using R ShinyApp, which presents our best forecast results at the local level of the UK, given all the data available to date. COVIDPredLTLA (officially launched on December 1, 2021, uses real-time data from the above sources and is currently updated twice a week. The application presents the predicted percentage changes (and uncertainties if applicable) in the COVID-19 cases in the current week (nowcasts ) and the one and two weeks ahead (forecasts) compared to the previous week, using the optimal models (which could technically be naive models or one of the full models), by dividing two forms (publish date and collection date of specimen) for each LTLA.

Analyzes were done with R software (version 4.1.1). We followed the STROBE guidelines for the reporting of observational studies and the EPIFORGE guidelines for the reporting of epidemic predictions and prediction research. All data in the analyzes were population aggregated data available in the public domain and therefore ethical approval was not required.

Reporting overview

More information on research design is available in the Nature Research Reporting Summary linked to this article.

Add a Comment

Your email address will not be published.