Computing the daily reproduction number of COVID-19 by inverting the renewal equation using a variational technique
Significance
Based on a signal-processing approach, we propose a method to compute the reproduction number Rt, the transmission potential of an epidemic over time. Rt is estimated by minimizing a functional that enforces: 1) the ability to produce an incidence curve it corrected of the weekly periodic bias produced by the weekend effect, obtained from Rt through a renewal equation; and 2) the regularity of Rt. A good agreement is found between our Rt estimate and the one provided by the currently accepted method, EpiEstim, except that our method predicts Rt several days closer to present. We provide the mathematical arguments for this shift.
Abstract
The COVID-19 pandemic has undergone frequent and rapid changes in its local and global infection rates, driven by governmental measures or the emergence of new viral variants. The reproduction number Rt indicates the average number of cases generated by an infected person at time t and is a key indicator of the spread of an epidemic. A timely estimation of Rt is a crucial tool to enable governmental organizations to adapt quickly to these changes and assess the consequences of their policies. The EpiEstim method is the most widely accepted method for estimating Rt. But it estimates Rt with a significant temporal delay. Here, we propose a method, EpiInvert, that shows good agreement with EpiEstim, but that provides estimates of Rt several days in advance. We show that Rt can be estimated by inverting the renewal equation linking Rt with the observed incidence curve of new cases, it. Our signal-processing approach to this problem yields both Rt and a restored it corrected for the weekend effect by applying a deconvolution and denoising procedure. The implementations of the EpiInvert and EpiEstim methods are fully open source and can be run in real time on every country in the world and every US state.
The reproduction number Rt is a key epidemiological parameter evaluating transmission potential of a disease over time. It is defined as the average number of new infections caused by a single infected individual at time t in a partially susceptible population (1). Rt can be computed from the daily observation of the incidence curve it, but requires empirical knowledge of the probability distribution of the delay between two infections (2, 3).
There are two different models for the incidence curve and its corresponding infection delay . In a theoretical model, it would represent the real daily number of new infections, and is sometimes called generation time (4, 5) and represents the probability distribution of the time between infection of a primary case and infections in secondary cases. In practice, neither parameter is easily observable because the infected are rarely detected before the appearance of symptoms, and tests will be negative until the virus has multiplied over several days. What is routinely recorded by health organizations is the number of new detected, incident cases. When dealing with this real incidence curve, is called the serial interval (4, 5). The serial interval is defined as the delay between the onset of symptoms in a primary case and the onset of symptoms in secondary cases (5).
Rt is linked to it and through the renewal equation, first formulated for birthdeath processes in a 1907 note of Alfred Lotka (6). We adopt the Nishiura et al. formulation (7, 8),[1]where tc represents the last time at which it was available, and f0 and f are the minimal and maximal observed times between a primary and a secondary case. The underlying epidemiological assumption of this model is that the time-varying factor Rt causes a constant proportional change in an individuals infectiousness, over the course of their entire infectious period, based on the day on which they were infected. In this case, we refer to Rt as the case reproductive number. According to Cori et al. (ref. 5, which cites ref. 9), It is the average number of secondary cases that a case infected at time step t will eventually infect.
It is important to note that secondary infections are sometimes detected before primary ones, and, therefore, the minimum delay f0 is generally negative (see Fig. 2). Eq. 1 does not yield an explicit expression for Rt. Yet, an easy solution can be found for the version of the renewal equation proposed in Fraser (equation 9 in ref. 9) and Cori et al. (5),[2]
By this equation, Rt is derived at time t from the past incidence values by a simple division, provided that :[3]
The underlying epidemiological assumption of this model is that the time-varying factor of Rt causes a change in the infectiousness only on the day on which transmission occurs.* In this case, we refer to Rt as the instantaneous reproduction number. This Rt estimate, implemented by the EpiEstim software, is highly recommended in a very recent review (10) signed by representatives from 10 different epidemiological laboratories from several continents.
EpiEstim is the standard method to compute a real-time estimation of the reproduction number and of widespread use. In its stochastic formulation, the first member it of Eq. 2 is assumed to be a Poisson variable, and the second member of this equation is interpreted as the expectation of this Poisson variable. This leads to a maximum-likelihood estimation strategy to compute Rt (5, 1114). A detailed description of the EpiEstim method can be found in SI Appendix.
Comparing Eqs. 2 and 1 shows that when applied with the same serial interval and case-incidence curve, the second equation is derived from the first by assuming Rt constant on the serial interval support . Replacing by Rt in Eq. 1 indeed yields Eq. 2. However, it is more precise, when replacing by a constant value, to take into account where the values of the serial interval are concentrated, namely, the expectation (of the probability distribution) of the serial interval. In particular, a more accurate interpretation of the quotient on the right of Eq. 3 would be[4]where is a central value of the probability distribution of the serial interval that could be, for instance, the median or the mean. In the Ma et al. (15) estimate of the serial interval for COVID-19, we have for the median and for the mean. This supports the hypothesis that EpiEstim estimates Rt with an average delay of more than 5 d.
In practice, the way the sliding average of the incidence is calculated causes another delay. Indeed, as illustrated in Fig. 1, the raw data of the incidence curve it can oscillate strongly with a 7-d period. This oscillation has little to do with the Poisson noise used in most aforementioned publications. Government statistics are affected by changes of testing and polling policies and by weekend reporting delays. These recording delays and subsequent rash corrections result in impulse noise and a strong weekly periodic bias observable on the incidence curve (in green) of Fig. 1 (Top).
» data-icon-position= » » data-hide-link-title= »0″>
Illustration of the EpiInvert method, using the online interface (25), on the US incidence curve of new cases. (Top) In green, the raw oscillating curve of incident cases up to September 5, 2021; in blue, the incidence curve after correction of the weekend bias; and in red, the incidence curve simulated from Rt after the inversion of the renewal equation. (Bottom) In black, Rt, the reproduction number estimated by the current EpiEstim method, adopted by most health experts (27), shifted back 8 d. Estimating its value every day guides the health policy of each country. Having Rt slightly smaller than one after a period of positive values of Rt, as is the case for the United States on September 5 means a potential modification of the pandemic trend. In red is the estimation of Rt by the EpiInvert method. This estimate, obtained by compensating the weekend bias and inverting the integral equation, has a temporal shift of about 8 d with respect to EpiEstim. The shadowed areas give the 90% and 95% CIs for the Rt estimation.
To reliably estimate the reproduction number, a regularity constraint on Rt is needed. Cori et al. (5), initiators of the EpiEstim method, use as regularity constraint the assumption that Rt is locally constant in a time window of size ending at time t (usually = 7 d). This results in smoothing the incidence curve with a sliding mean over 7 d. This assumption has two limitations: It causes a significant resolution loss and an additional backward shift in the estimation of Rt, given that Rt is assumed constant in .
In summary, the computation of Rt by Eqs. 1 and 2 raises three challenges:
-
1) The renewal equation [1] involves future values of it, those for .
-
2) Its second form, Eq. 2, used by the standard method estimates Rt with a backward shift of about 5 d.
-
3) Smoothing of the weekend effect causes a 3.5-d shift backward.
These cumulative backward shifts may cause a time delay of up to 8.5 d. We shall give an experimental confirmation of such delays by two independent methods: using a simulator with synthetic ground truths and a thorough study of the incidence curves of 55 countries. The practical meaning of this study is that the value of Rt computed by EpiEstim at time t might refer approximately to .
Here, we address these three issues by proposing a method to invert the renewal equations [1] and [2]. The inversion method developed for Eq. 1 is illustrated in Fig. 1 (Bottom), where the EpiEstim result using the renewal equation [2] (in black) is superposed with the estimate (in red) of Rt by EpiInvert using Eq. 1. After registering both, the black EpiEstim curve stops 8 d before EpiInvert, the red curve. More generally, we found, using the incidence curve of 55 countries, that the median of the temporal shift between the EpiEstim and EpiInvert Rt estimates using the form Eq. 1 of the renewal equation is about 8.24 d and that the median of the root mean squared error (RMSE) between both estimates is just about 0.036.
This result is slightly surprising, given that the interpretation of Rt in both equations is different and that the serial interval used in both equations also is different. In Eq. 2, the serial interval is indeed truncated to preserve the temporal causality of this equation. This excellent 0.036 fit nevertheless suggests that the EpiInvert method, applied to the renewal equation [1], is compatible with the EpiEstim method, but brings an Rt estimate closer to present. This fact will be investigated by simulations in Sections 3 and 4.
One of the key points of the success of EpiInvert is the correction of the weekly bias. For instance, in Fig. 1 (Top), we observe that, in the last months, the registered incidence curve (green) in the United States shows a clear and significant weekly bias, which is the main source of perturbation of the incidence curve. EpiInvert is able to properly correct the weekly bias (blue curve), which allows for a more accurate estimation of Rt closer to present and a restored incidence curve simulated from Rt after the inversion of the renewal equation (red curve).
The general integral Eq. 1 is a functional equation in Rt. Integral equations have been used to estimate Rt: In ref. 16, the authors estimate Rt as the direct deconvolution of a simplified integral equation, where it is expressed in terms of Rt and it in the past, without using the serial interval. Such inverse problems involving noise and a reproducing kernel can be resolved through the TikhonovArsenin (17) variational approach involving a regularization term. This method is widely used to solve integral equations and convolutional equations (18). The solution of the equation is estimated by an energy minimization. The regularity of the solution is obtained by penalizing high values of the derivative of the solution. Our variational formulation includes the correction of the weekly periodic bias, or weekend effect. The standard way to deal with a weekly periodic bias is to smooth the incidence curve by a 7-d sliding mean. This implicitly assumes that the periodic bias is additive. The present study supports the idea that this bias is better dealt with as multiplicative. In the variational framework, the periodic bias is therefore corrected by estimating multiplicative periodic correction factors. This is illustrated in Fig. 1 (Top), where the green oscillatory curve is transformed into the blue filtered curve by the same energy-minimization process that also computes Rt (Bottom, in red) and reconstructs the incidence curve up to present by evaluating the renewal equation using the computed Rt and the filtered incidence curve (Top, in red).
In this work we use two versions of the renewal equation formulation to compute Rt. It is, however, possible to formulate statistical models for Rt that do not take into account the serial interval and the renewal equation. For instance, in ref. 19, the author proposes to use the model:[5]where Zt is an independent and identically distributed sequence of standard normal random variables, is the dispersion of the random walk, and is the coefficient of drift. The model was fit to the provided incidence data by applying Bayesian inference on the parameter and state space with assumed prior distributions.
1. Available Serial Interval Functions for SARS-CoV-2
As we saw, the serial interval in epidemiology refers to the time between successive observed cases in a chain of transmission. Du et al. (20) define it as the time duration between a primary case (infector) developing symptoms and secondary case (infectee) developing symptoms.
Du et al. (20) obtained the distribution of the serial interval by a careful inquiry on 468 pairs of patients, where one was the probable cause of the infection of the other. The serial distribution obtained (20) has a significant number of cases on negative days, meaning that the infectee had developed symptoms up to 10 d before the infector . In addition to this first serial interval, we test a serial interval obtained by Nishiura et al. (21) using 28 cases, which is approximated by a log-normal distribution, and a serial interval obtained by Ma et al. (15) using 689 cases. As proposed by the authors, this serial interval has been approximated by a shifted log-normal to take into account the cases in the negative days. In Fig. 2, we show the profile of the three serial intervals. There is good agreement of the serial intervals obtained by Du et al. (20) and Ma et al. (15). Note that for the Ma et al. serial interval, for Nishiura et al., and for Du et al. The discrete support of is therefore contained in the interval .
» data-icon-position= » » data-hide-link-title= »0″>
Serial intervals used in our experiments: the discrete one proposed by Du et al. (20) (solid bars in blue), the serial interval proposed by Ma et al. (15) (solid bars in orange) and its shifted log-normal approximation (in green), and finally a log-normal approximation of the serial interval proposed by Nishiura et al. (21) (in red).
We are assuming that the serial interval profile does not change across time. As stated in ref. 22 (equation 10), it is nevertheless possible to use a more general form of the renewal equation [1],[6]where is the forward serial interval, which takes into account that the onset of symptoms and transmission potential can jointly depend on the life history of a disease. The forward serial interval measures the time forward from symptom onset of an infector, obtained from a cohort of infectors that developed symptoms at the same time t s. This more general form of the renewal equation is used in ref. 22 to properly link the initial epidemic growth to the reproduction number R0. The variational approach proposed in the present work can be easily extended to compute Rt from it and , provided an estimation of is available.
Transmissibility can also depend on coronavirus lineage. For instance in ref. 23, the authors show that the SARS-CoV-2 variant B.1.1.7 has a 43 to 90% higher reproduction number than preexisting variants. It cannot be ruled out that these new variants have a different serial interval than preexisting ones.
2. Computing Rt by a Variational Method
As explained in the previous section, we aim at solving two versions of the renewal equation[7]
where[8]
F2 corresponds to the case reproductive number formulation (Eq. 1) and F1 to the instantaneous reproduction number formulation (Eq. 2). Both formulations of the renewal equation are valid, and we can apply our methodology to both. As we shall see, this leads to anticipate by several days the estimate of Rt. Eq. 2 is also used in the classic Wallinga Teunis method (4), as shown in SI Appendix. This last method is widely used to compute Rt, retrospectively.
Correcting the Weekend Effect
We must first formulate a compensation for the weekend effect, which in most countries is stationary, strong, and the main cause of discrepancy between it and its expected value . To remove the weekend effect, we estimate periodic multiplicative factors defined by a vector .
The variational framework we propose to estimate Rt is therefore given by the minimization of the energy[9]where denotes the remainder of the Euclidean division of t by 7, t = 0 represents the beginning of the epidemic spread, and tc represents the date of the last available incidence value.
The weekend effect has varied over the course of the pandemic. Hence, for the estimate of q, it is better to use a time interval , where T is fixed in the experiments to T = 56 (8 wk). This 2-mo time interval is long enough to avoid overfitting and small enough to ensure that the testing policy has not changed too much. The optimization of Rt is still performed through the whole time interval . The corrected value amounts to a deterministic attenuation of the weekend effect on it. An obvious objection is that this correction might not be mean-preserving. To preserve the number of accumulated cases in the period of estimation, we therefore add the constraint[10]
to the minimization problem Eq. 9.
In that way, the multiplication by the factor produces a redistribution of the cases it during the period of estimation, but it does not change the global number of cases. In Eq. 9, is the median of it in the interval used to normalize the energy with respect to the size of it. In the experiments, we use = 21. The first term of E is a data-fidelity term, which forces the renewal Eq. 7 to be satisfied as closely as possible. The second term is a classic TikhonovArsenin regularizer of Rt. As in the case of EpiEstim, this method provides a real-time estimate of Rt up to the date, tc, of the last available incidence value. Yet, in contrast with EpiEstim, this method takes advantage for of the knowledge of the incidence curve for . This improves the posterior accuracy of the Rt estimate.
The Regularization Weight
The regularization weight is a dimensionless constant weight fixing the balance between the data-adjustment term and the regularization term.
Boundary Conditions of the Variational Model
Since t = 0 is the beginning of the epidemic spread where the virus runs free, one is led to use an estimate of based on the basic reproduction number R0. (In SI Appendix, we present a basic estimation of R0 from the initial exponential growth rate of the epidemic obtained as in ref. 24). Therefore, to solve Eq. 9, we add the boundary condition . The proposed inversion model provides an estimation of Rt up to the date, tc, of the last available incidence value. Yet, if , the functional Eq. 9 involves a few future values of Rt and it for . These values are unknown at present time tc. We use a basic linear regression using the last seven values of it to extrapolate the values of it beyond tc. We show in SI Appendix that the boundary conditions and the choice of the extrapolation procedure have a minor influence in the estimation of Rt in the last days when minimizing Eq. 9.
All of the experiments described here can be reproduced with the online interface available in ref. 25. This online interface allows one to assess the performance of the method applied to the total world population and to any country and any state in the United States, with the last available data. We detail our daily sources in SI Appendix.
An Empirical CI for Rt
In the absence of a statistical model on the distribution of Rt, no theoretical a priori CI for this estimate can be given. Nevertheless, a realistic CI is obtained by the following procedure: Let us denote by the EpiInvert estimate at time t using the incidence curve up to the date . Therefore, represents the final EpiInvert estimate of Rt using the incidence data up to the last available date tc. As shown below using the real and simulated data, for stabilizes for . We can therefore consider as an approximation of the reproduction number ground truth. We want to provide an empirical CI such that 95% of the times (for ). To define r(t) we use, on the one hand, a measure of the variation of Rt in the last few days given by[11]where in are obtained by linear extrapolation. In our experiments, we use N = 3. On the other hand, we use, supported by results obtained below for real and simulated data, that the error in the estimation of Rt grows linearly when t approaches tc (the last time at which it was available). Combining with a linear function with respect to , we obtain the following expression for r(t):[12]
where B and C are parameters of the estimation and . The advantage of this empirical approach is that the estimation of the CI is adapted to the variation of Rt in the last few days. To compute the empirical CIs, we use the incidence curve of 55 countries. For each country, and for any and , we refer to as the EpiInvert estimate of the reproduction number at time t using the incidence curve up to the time t + k. Using the values of for and the 55 countries, we obtain, in the case of , that using B = 0.24 and C = 0.03, 95% of times (for ). In the same way, for the empirical 90% CI, we obtain B = 0.16 and C = 0.022. If we consider now , then it is observed that stabilizes for k = 3. Using it as ground truth, the obtained empirical CIs for are given by B = 0.04 and C = 0.016 (in the case of 95%) and B = 0.02 and C = 0.009 (in the case of 90%) These empirical intervals are displayed for each t in the online algorithm in ref. 25 and have the aspect of fattened curves above and below Rt.
Efficiency Measure of the Weekly Bias Correction
We estimate the correction of the weekly periodic bias by the efficiency measure[13]
I represents the reduction factor of the RMSE between the incidence curve and its estimate using the renewal equation after correcting the weekend bias. and R are the optimal values for the energy Eq. 9, and R1 denotes the R estimate without correction of the weekly bias. The value of I can be used to assess whether it is worth applying the correction of the weekly periodic bias to a given country in a given time interval.
Estimation of the Temporal Shift between EpiEstim and EpiInvert.
In what follows, we will denote by the EpiEstim estimation of the reproduction number by Cori et al. (5), detailed in SI Appendix. As we have argued above, we expect a significant temporal shift between the EpiInvert estimate of Rt and of the order of 9 d. This expectation is strongly confirmed by the experimental results and can be checked by applying the proposed method to any country using the online interface available in ref. 25. In summary, the time shift between both methods should be a half-week (3.5 d) for and by Eq. 4 of about for . This will be verified experimentally by computing the shift between and Rt yielding the best RMSE between both estimates:[14]
where T = 56 (8 wk) and where we evaluate for noninteger values of k t by linear interpolation.
Summary of the Algorithm Parameters and Options
-
Choice of the serial interval: The default options are the serial intervals obtained by Ma et al. (we use the shifted log-normal approximation), Nishiura et al., and Du et al. The users can also upload their own serial interval;
-
Choice of the renewal equation used, or ;
-
Correction of the weekly periodic bias (option by default)
The regularization weight w is always fixed to five, the value we obtain below, experimentally, by comparing with EpiEstim.
Summary of the Output Displayed in Ref. 25
First, we draw two charts. In the first one, we draw Rt and shifted back days where is defined in Eq. 14. Rt is surrounded by a shaded area that represents the above-defined empirical CIs. In the second chart, we draw the initial incidence curve it in green, the incidence curve after the correction of the weekly periodic bias in blue, and the evaluation of the renewal equation given by in red. For each experiment, we also compute:
-
1) : last available value of the EpiInvert Rt estimate.
-
2) : last available value of the EpiEstim estimate .
-
3) : optimal shift (in days) between R and REpi defined in Eq. 14.
-
4) : RMSE between R and REpi shifted back days (defined in Eq. 14).
-
5) : variability of the original incidence curve, it, given by:[15]
-
6) : variability of the filtered incidence after the correction of the weekly periodic bias.
-
7) I: reduction factor of the RMSE between the incidence curve and its estimate using the renewal equation after the correction of the weekly periodic bias (defined in Eq. 13).
-
8) : the correction coefficients of the weekly periodic bias (q6 corresponds to the tc, the last time at which it was available).
3. Results on Incidence Curves from 55 Countries
To estimate a reference value for the regularization parameter w, we used the incidence data up to July 17, 2021, for the 55 countries showing the larger number of cases. For each country, we performed 30 experiments. Starting with the incidence data up to July 17, in each experiment, we removed the last 10 d from the incidence data used in the previous experiment. In that way, we got a large variety of real epidemic scenarios. We highlight the challenges of interpreting real case report time series, which frequently have an additional layer of convolution/delay relative to the true case incidence time series due, for instance, to the delay between the test and their registration in the official records.
We optimized the RMSE between Rt and shifted back days (defined in Eq. 14). This optimization was performed with respect to w and . The goal was to fix w, the only parameter of the method, so that the result of EpiInvert is as close as possible to EpiEstim in the days where both methods predict Rt. The second goal of this optimization was to estimate the effective time shift between both methods.
In Fig. 3, we show the box plot of the distribution of w for F1 and F2 when w was optimized independently for each experiment to minimize the average error over 56 d between the EpiEstim and the EpiInvert methods. The median of the distribution of w is five for F1 and F2, which indicated that a common value of w could be fixed for all countries. Here, and in all figures to follow, each dot represents the average of all experimental results associated to a country.
» data-icon-position= » » data-hide-link-title= »0″>
Distribution of w for F1 and F2 when the regularization weight w and the delay are optimized independently for each country to minimize the average error between the EpiEstim and the EpiInvert methods on a time lapse of 56 d. France, blue; Japan, green; Peru, cyan; South Africa, magenta; United States, red.
In Fig. 4, we show, for the versions F1 and F2 of the renewal equation, the average error over 56 consecutive days of the error between the EpiEstim and the EpiInvert estimates of Rt for each country. The median of the overall average error is 0.025 for F1 and 0.034 for F2.
» data-icon-position= » » data-hide-link-title= »0″>
Average error between the EpiEstim and the EpiInvert estimates of Rt for each country. In Left, w is fixed, and in Tight, it is the optimal weight per country. France, blue; Japan, green; Peru, cyan; South Africa, magenta; United States, red.