Correct incidence data for yet-to-be-observed fraction of events

Use this function to correct the tail of an incidence time series if incidence was collected following a subsequent observation event. For instance, if the incidence represents people starting to show symptoms of a disease (dates of onset of symptoms), the data would typically have been collected among individuals whose case was confirmed via a test. If so, among all events of onset of symptoms, only those who had time to be confirmed by a test were reported. Thus, close to the present, there is an under-reporting of onset of symptoms events. In order to account for this effect, this function divides each incidence value by the probability of an event happening at a particular time step to have been observed. Typically, this correction only affects the few most recent data points.

nowcast(
  incidence_data,
  delay_until_final_report,
  cutoff_observation_probability = 0.33,
  gap_to_present = 0,
  ref_date = NULL,
  time_step = "day",
  ...
)

Arguments

incidence_data	An object containing incidence data through time. It can either be: A list with two elements: A numeric vector named `values`: the incidence recorded on consecutive time steps. An integer named `index_offset`: the offset, counted in number of time steps, by which the first value in `values` is shifted compared to a reference time step This parameter allows one to keep track of the date of the first value in `values` without needing to carry a `date` column around. A positive offset means `values` are delayed in the future compared to the reference values. A negative offset means the opposite. A numeric vector. The vector corresponds to the `values` element descrived above, and `index_offset` is implicitely zero. This means that the first value in `incidence_data` is associated with the reference time step (no shift towards the future or past).
delay_until_final_report	Single delay or list of delays. Each delay can be one of: a list representing a distribution object a discretized delay distribution vector a discretized delay distribution matrix a dataframe containing empirical delay data
cutoff_observation_probability	value between 0 and 1. Only datapoints for timesteps that have a probability of observing a event higher than `cutoff_observation_probability` are kept. The few datapoints with a lower probability to be observed are trimmed off the tail of the timeseries.
gap_to_present	Integer. Default value: 0. Number of time steps truncated off from the right tail of the raw incidence data. See Details for more details.
ref_date	Date. Optional. Date of the first data entry in `incidence_data`
time_step	string. Time between two consecutive incidence datapoints. "day", "2 days", "week", "year"... (see `seq.Date` for details)
...	Arguments passed on to `get_matrix_from_empirical_delay_distr` `min_number_cases` integer. Minimal number of cases to build the empirical distribution from. If `num_steps_in_a_unit` is `NULL`, for any time step T, the `min_number_cases` records prior to T are used. If less than `min_number_cases` delays were recorded before T, then T is ignored and the `min_number_cases` earliest-recorded delays are used. If `num_steps_in_a_unit` is given a value, a similar same procedure is applied, except that, now at least `min_number_cases` must be taken over a round number of time units. For example, if `num_steps_in_a_unit = 7`, and time steps represent consecutive days, to build the distribution for time step T, we find the smallest number of weeks starting from T and going in the past, for which at least `min_number_cases` delays were recorded. We then use all the delays recorded during these weeks. Weeks are not meant as necessarily being Monday to Sunday, but simply 7 days in a row, e.g. it can be Thursday-Wednesday. Again, if less than `min_number_cases` delays were recorded before T, then T is ignored. We then find the minimum number of weeks, starting from the first recorded delay that contains at least `min_number_cases`. `upper_quantile_threshold` numeric. Between 0 and 1. Argument for internal use. `min_number_cases_fraction` numeric. Between 0 and 1. If `min_number_cases` is not provided (kept to `NULL`), the number of most-recent cases used to build the instant delay distribution is `min_number_cases_fraction` times the total number of reported delays. `min_min_number_cases` numeric. Lower bound for number of cases used to build an instant delay distribution. `fit` string. One of "gamma" or "none". Specifies the type of fit that is applied to the columns of the delay matrix `num_steps_in_a_unit` Optional argument. Number of time steps in a full time unit (e.g. 7 if looking at weeks). If set, the delays used to build a particular delay distribution will span over a round number of such time units. This option is included for comparison with legacy code.

Value

A list with two elements:

A numeric vector named values: the result of the computations on the input data.
An integer named index_offset: the offset, counted in number of time steps, by which the result is shifted compared to an index_offset of 0. This parameter allows one to keep track of the date of the first value in values without needing to carry a date column around. A positive offset means values are delayed in the future compared to the reference values. A negative offset means the opposite. Note that the index_offset of the output of the function call accounts for the (optional) index_offset of the input.

If index_offset is 0 and simplify_output = TRUE, the index_offset is dropped and the values element is returned as a numeric vector.

Details

A trimming is done at the tail of the time series to avoid correcting for time steps for which the observation probability is too low, which could result in too uncertain corrected values. This trimming is tuned via the cutoff_observation_probability argument.

The gap_to_present represents the number of time steps truncated off on the right end of the raw data. If no truncation was done, gap_to_present should be kept at its default value of 0. A truncation can be done when latest reported numbers are too unreliable, e.g. in a monitoring situation the latest X days of data can be deemed not worth keeping if they are not well consolidated. An alternative to this truncation is actually to nowcast the observed incidence using this function and a delay distribution representing the consolidation delay. Contrary to best-practice nowcasting methods, this function only provides a maximum-likelihood estimator of the acual incidence, it does not include uncertainty around this estimator.

The ref_date argument is only needed if the delay_until_final_report is passed as a dataframe of individual delay observations (a.k.a empirical delay data). In that case, ref_date must correspond to the date of the first time step in incidence_data.

Examples

## Basic usage of nowcast

shape_onset_to_report = 2.7
scale_onset_to_report = 1.6
delay_onset_to_report <- list(name="gamma",
                              shape = shape_onset_to_report,
                              scale = scale_onset_to_report)

corrected_incidence_data_1 <- nowcast(
  incidence_data = HK_incidence_data$onset_incidence,
  delay_until_final_report = delay_onset_to_report
)


## Advanced usage of nowcast
# Only taking into account cases that have a chance of being observed greater
# than 25%. Here, the delay between symptom onset and report is given as
# empirical delay data, hence it is needed to specify the date of the first
# entry in incidence_data

corrected_incidence_data_2 <- nowcast(
  incidence_data = HK_incidence_data$onset_incidence,
  delay_until_final_report = HK_delay_data,
  ref_date = HK_incidence_data$date[1],
  cutoff_observation_probability = 0.25
)