Using malclimsim to infer parameters of a climate-driven dynamical transmission model
Source:vignettes/using-malclimsim.Rmd
using-malclimsim.Rmd
Overview
The climate-driven dynamical transmission model used in this package is an adaptation from the one constructed by Ukawuba and Shaman in 2022 (https://doi.org/10.1371/journal.pcbi.1010161). The original paper calibrated the model using data at the provincial and district levels (district-level analysis found in the supplement).
A few key characteristics of the model are that only humans are modeled explicitly but climate-mosquito dynamics are incorporated through the addition equations relating rainfall and temperature to the entomological innoculation rate. It is a compartmental model with 10 total compartments across two age groups, splitting the population into those who are Susceptible to disease, those who have been Exposed and are infected by malaria parasites, those who have Symptomatic Infection but untreated, those who have been Treated, and those who have Asymptomatic infection.
Currently, the model assumes a year is 360 days with twelve 30 day months when using monthly data and assumes 365 day years with 7 day weeks when using weekly data. There are two age groups being modeled - those under 5 years old and those who are 5 years and older. Only one intervention is being considered here - seasonal malaria chemoprevention.
This vignette will go through the process of defining the inputs for, and simulating from, a predefined model of malaria transmission. Afterwords, a number parameters of this model will be inferred using the mcstate package. Lastly, diagnostics will be examined and relevant quantities calculated.
#detach("package:malclimsim", unload = TRUE)
#install.packages("odin.dust", repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))
#install.packages("mcstate", repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))
#devtools::install_github("https://github.com/SwissTPH/malclimsim", ref = "master")
# testing
library(malclimsim)
Setting Up Model Inputs
Climate data downloading and processing
The first step to using the model is to specify the region (latitude and longitude) in which you want your model to simulate from. Additionally, the years used in the analysis must be specified.
# Years used for the analysis
years_clim <- 2017:2023
start_date <- ymd("2018-01-01") # Start date of the analysis
end_date <- ymd("2023-12-31") # End date of the analysis
years_analysis <- year(start_date) : year(end_date)
print("this is a test")
# Latitude and longitude where climate data (rainfall and temperature) is to be saved
# Rainfall data is from CHIRPS and temperature data is from ERA5
lat <- 8.34
lon <- 17.77
Next, “save_climate_data” will download the raw data from the specified coordinates into the directory given by ‘path_to_data’, where the data can be loaded from for future uses. The temperature data comes from ‘ERA5-Land’ (https://cds.climate.copernicus.eu/datasets/derived-era5-land-daily-statistics?tab=overview) and the rainfall data comes from ‘CHIRTSdaily’ (https://www.chc.ucsb.edu/data/chirtsdaily). To get access to the CHIRPS data (and run the below function), one must register with the European Centre for Medium-Range Weather Forecasts API - instructions found here - https://bluegreen-labs.github.io/ecmwfr/.
# Saving the rainfall data locally based on specified path
save_climate_data(lon = lon, lat = lat, years = years_clim,
path_to_data = path_to_data,
rain_file_name = rain_file_name, temp_file_name = temp_file_name)
Now the data is processed using the ‘process_climate_data’ function. A few things are done here. First, as the temperature data is in the form of monthly averages and the model requires an input for each day, smoothing splines are used to fit a smooth function to the data. This function is then used to estimate daily temperatures. A few more things are done with the rainfall data. First, the cumulative monthly rainfall was calculated. Then, a smoothing spine was fit to this monthly data. Afterwards, daily estimated were estimated from the fit. Lastly, the daily estimates were standardized to have a mean of 0 and a variance of 1.
For more details, look at the code found in “climate_processing_functions.R”.
# Reading in climate data saved by `save_climate_data`
temp_path <- paste0(path_to_data, temp_file_name)
rain_path <- paste0(path_to_data, rain_file_name)
# Processing climate data to be used in the model
# `months_30_days` determines if years are 360 or 365 days - if using weekly data, 365
# days must be used and if using monthly data, 360 days must be used
met_365 <- process_climate_data(lon, lat, years = years_clim, temp_path = temp_path,
rain_path = rain_path, months_30_days = FALSE)
Seasonal malaria chemoprevention data downloading and processing
Next we will load the SMC data, which gives information about when each round of SMC was given and the coverage for that round. As of now, for the functions that process this SMC data, the format (column names) must be as shown below.
data(smc_data_raw)
head(smc_data_raw, n = 5)
smc_data <- load_clean_smc_data(data = smc_data_raw)
head(smc_data, n = 5)
One now needs to define the inputs related to seasonal malaria chemoprevention (SMC). There are three vectors that are needed which should all the same length as the rainfall and temperature vectors. One vector consists of 1s and 0s that define the start of a round of SMC. The second vector is the coverage of SMC for that round. And the last vector is the result of a decay function applied to the coverage over time. The total effect of SMC at a given time point is defined as the product (component-wise multiplication) of these vectors at this time.
To go from the clean dataset, where we have rounds given in monthly intervals, to daily intervals required by the model, we call the `smc_schedule_from_data’ function.
smc_schedule <- smc_schedule_from_data(
smc_cov = smc_data,
months_30_days = FALSE,
years = years_analysis
)
head(smc_schedule, n = 200)
Ensuring alignment of model inputs
As mentioned before, it is necessary that the climate and SMC inputs begin on the same date - otherwise a given day will not have the climate and SMC information actually corresponding to that day. To ensure that this is the case, we have a helper function that checks this condition and another that trims them accordingly. Furthermore, as we have a lag on climate, we need the climate vectors to have data before the date when SMC starts. The following code helps to ensure everything is aligned.
met_365 <- impute_climate_to_end_date(met_365, max(smc_schedule$dates)) # climate data only available until December 2023
# Specify lag
clim_smc_lag <- 180
# Validate continuity, span, and alignment
# validate_smc_climate_alignment(smc_schedule, met_365, clim_smc_lag)
# Trim & align
aligned <- lag_and_trim_smc_climate(smc_schedule, met_365, clim_smc_lag)
smc_schedule <- aligned$smc
met_365 <- aligned$climate
validate_smc_climate_alignment(smc_schedule, met_365, clim_smc_lag)
Simulating from the Malaria Model
One last model input is the population size of the region being studied. In the case of Chad, there is much uncertainty about the true population size so we will also include in the model a population scaling factor. This means that the results are not so sensitivity to the chosen population size value.
N <- 150000
Now, the climate model is loaded using “load_model” which takes as an input the name of the model. Here it is called “model_new_R_with_FOI”. In principle, one could adjust the model or add new models which could then be loaded and simulated from. The model itself is written in ‘odin’, a domain-specific language (DSL) created for facilitating the writing of efficient state-space models. More information can be found in the original paper by FitzJohn et al (10.12688/wellcomeopenres.16466.2).
malaria_model <- load_model("model_new_R_with_FOI") # Load the deterministic climate model
One last object to define before running the model itself. This input is a named list where each name corresponds to a parameter within the model and the values can be specified based on prior knowledge or by fitting to some observed data. Many of the parameter values here come from the paper by Ukawuba and Shaman where the model was fit to district-level data in Rwanda.
Found below is a table with information about each parameter (taken directly from Ukawuba), as well as the corresponding name in the paper by Ukawuba, as the naming convention differs slightly for some parameters (still need to define).
# Extract rainfall and temperature data
rain <- met_365$anom # Standardized rolling mean of rainfall
temp <- met_365$temp # Temperature data
# Extract key SMC schedule information
SMC <- smc_schedule$SMC # Indicator for days when an SMC round started (1s and 0s)
decay <- smc_schedule$decay # Efficacy decay of SMC over time
cov <- smc_schedule$cov # SMC coverage over time
# Define parameter inputs for the malaria model simulation
param_inputs <- list(
# Rate and Demographic parameters
mu_TS = 1/30, mu_IR = 1/5, mu_RS = 1/195,
mu_EI = 1/10, delta_b = 47/(1000*365), delta_d = 47/(1000*365),
delta_a = 1/(5 * 365), N = N, percAdult = 0.81,
# Immunity and Reporting Parameters
pi_s_1 = 0.75, c_s = 0.12,
qR = 0.24,
# Immunity and Reporting Parameters
fT_C = 0.28,
# Population Scaling factor
s = 9,
# Growth rates
r_C_0 = 1.000079,
r_A_0 = 1.000108,
# Time-varying Inputs
decay = decay,
SMC = SMC,
cov_SMC = cov,
c_R_D = rain, temp = temp,
# Climate Parameters
b = 3,
R_opt = 0,
alpha = 1,
sigma_LT = 4, sigma_RT = 4,
k1 = 1.5, lag_R = 30, lag_T = 15,
# Likelihood function parameters
size_1 = 15,
# SMC parameters
eff_SMC = 0.8,
# Misc
clim_SMC_lag = clim_smc_lag
)
Before doing inference, let’s first take a look at what the model output looks like using the parameter values defined above. We first simulate weekly malaria cases using the `data_sim’ function (this function is used a lot behind the scenes throughout the fitting and inference diagnostics process). Afterwards, we’ll plot the weekly malaria cases given by the model from 2018 to the end of 2023. Note that the default parameter values above are “selected” - it is worth changing values and seeing how it changes the model outputs!
# Run the climate-malaria model simulation
results <- data_sim(
malaria_model, param_inputs = param_inputs, start_date, end_date,
month = FALSE, round = FALSE, save = FALSE, month_unequal_days = TRUE
)
plot_time_series(results = results, plot_title = "Simulated Weekly Malaria Cases (2018 to 2023)",
select_incidence = "<5", incidence_y_label = "Weekly Malaria Cases")
Model Inference
Loading observed data
The observed data needs to have columns named in the same way shown below. If the data is monthly, the `date_ymd’ column should correspond to the first of the month in year, month, day format.
head(obs_cases, n = 3)
Weighting observed data
In our example, SMC was deployed each year except for 2019. This means the inference procedure, which aims to maximize the posterior probability globally, naturally gives more weight to years when SMC was deployed (simply because there are more of them). The consequence is a tendency to have worse model predictions for the year that SMC was stopped. As we are ultimately interested in comparing scenarios where SMC was and was not given, comparable performance across SMC status is preferred to better global predictions.
Towards this goal, we take a simple approach and weight the non SMC observations by how many times more SMC observations there are (e.g. if across the dataset, SMC was given for 5 years and not given for 1 year, then the weight for each observation in the years SMC was given would be 1, and the weight would be 5 for the year SMC was not given).
Defining model parameters to estimate
# Listing the model parameters to be estimated
params_to_estimate <- c(lag_R = "lag_R", lag_T = "lag_T",
sigma_LT = "sigma_LT", sigma_RT = "sigma_RT",
k1 = "k1", size_1 = "size_1",
eff_SMC = "eff_SMC", s = "s",
b = "b")
Defining parameters of the Random-walk Metropolis Algorithm
Inference is done in the Bayesian context, meaning we are targeting the posterior distribution of the parameter set given the observed data, which is proportional to the likelihood and prior probabilities. We simulate from this posterior distribution using a popular Markov chain Monte Carlo algorithm called ‘Random Walk Metropolis’. This is implemented in the ‘mcstate’ package - more information can be found in the original paper outlining the package (Fitz et al 2021).
Random-Walk Metropolis has different parameters that must be specified prior to running the algorithm. To start, we need to decide the number of steps to run the algorithm for. Additionally, we must decide how many chains to run. One also needs to select where these chains start. Furthermore, we have to define the covariance matrix of the proposal distribution (here the proposal distribution is a multivariate normal). This choice of a covariance matrix greatly impacts the efficiency of the algorithm, especially when the parameters being estimated have a complex dependency structure. Therefore, it’s important to choose one carefully. Luckily, mcstate has build in functionality for an “adaptive” proposal which learns the optimal proposal distribution during the MCMC procedure (Spencer SEF 2021). More details can be found here - https://mrc-ide.github.io/mcstate/reference/adaptive_proposal_control.html.
In our case, we do inference in three stages. First, we start with a relatively uninformed proposal distribution with a high level of adaptation. Afterwards, we use this first stage to inform a second stage where we update the proposal distribution and the starting values of the chain. This second stage has less strong adaptation and the idea is to continue until some sort of “convergence” is reached. Lastly, the algorithm is run with no adaptation so that the algorithm represents valid draws from the posterior distribution.
params_default <- create_mcmc_params(stage = "stage1")
adaptive_params_1 <- params_default$adaptive_params
control_params_1 <- params_default$control_params
control_params_1$n_steps = 500
# Defining starting values for chains
start_values_1 <- create_start_values(params_to_estimate, control_params_1,
model = malaria_model, param_inputs = param_inputs,
random = TRUE, seed = NULL)
# Proposal matrix
proposal_matrix_1 <- create_proposal_matrix(params_to_estimate = params_to_estimate,
model = malaria_model, param_inputs = param_inputs)
param_names <- rownames(proposal_matrix_1)
Defining aspects of the inference procedure not related to MCMC
Here we choose if we are using monthly or weekly data, which age groups we are fitting to, and whether or not we should account for population growth (this is currently being included as part of the fitting procedure and not within the model itself.
inf_config <- make_obs_config(
use_monthly = FALSE,
age_group = c("u5"),
include_pop_growth = TRUE
)
# Small wrapper around inf_run() + diagnostics + saving
run_stage <- function(proposal_matrix, start_values, control_params,
adaptive_params, out_dir, out_file, rerun_n = 1000,
rerun_random = TRUE, seed = 24) {
# Infer with all the shared args baked in
results <- inf_run(model = malaria_model, param_inputs = param_inputs,
control_params = control_params,
params_to_estimate = params_to_estimate,
proposal_matrix = proposal_matrix,
adaptive_params = adaptive_params,
start_values = start_values,
noise = FALSE, seed = seed,
month_unequal_days = FALSE,
dates = c(start_date, end_date),
synthetic = FALSE, incidence_df = obs_cases,
save_trajectories = FALSE, rerun_n = rerun_n,
rerun_random = rerun_random, param_priors = NULL,
n_years_warmup = 3, obs_config = inf_config)
# Save to disk
saveRDS(results, file = paste0(out_dir, out_file))
# Return the results object
invisible(results)
}
Running Inference
mcmc_results_out_dir = "C:/Users/putnni/Documents/git-hub-repositories/vignette-storage/malclimsim/mcmc-results/"
mcmc_stage_1 <- run_stage(proposal_matrix_1, start_values_1,
control_params = control_params_1,
adaptive_params = adaptive_params_1,
out_dir = mcmc_results_out_dir,
out_file = "mcmc_results_stage_1.rds")
mcmc_stage_1 <- readRDS(paste0(mcmc_results_out_dir, "mcmc_results_stage_1.rds"))
inf_params_stage_2 <- update_inf_stage(results_obj = mcmc_stage_1,
proposal_matrix = proposal_matrix_1,
param_names = rownames(proposal_matrix_1),
S_prev = 3000, draw_n = ncol(start_values_1),
shrink = 0.95, stage = "stage2", n_steps = 1000)
rm(mcmc_stage_1)
mcmc_stage_2 <- run_stage(proposal_matrix = inf_params_stage_2$proposal_matrix,
start_values = inf_params_stage_2$start_values,
control_params = inf_params_stage_2$control_params,
adaptive_params = inf_params_stage_2$adaptive_params,
out_dir = mcmc_results_out_dir,
out_file = "mcmc_results_stage_2.rds")
mcmc_stage_2 <- readRDS(paste0(mcmc_results_out_dir, "mcmc_results_stage_2.rds"))
inf_params_stage_3 <- update_inf_stage(results_obj = mcmc_stage_2,
proposal_matrix = inf_params_stage_2$proposal_matrix,
param_names = rownames(proposal_matrix_1),
S_prev = 3000, draw_n = 3,
shrink = 0.8, stage = "noadapt", n_steps = 1000)
rm(mcmc_stage_2)
mcmc_stage_3 <- run_stage(proposal_matrix = inf_params_stage_3$proposal_matrix,
start_values = inf_params_stage_3$start_values,
control_params = inf_params_stage_3$control_params,
adaptive_params = inf_params_stage_3$adaptive_params,
out_dir = mcmc_results_out_dir,
out_file = "mcmc_results_stage_3.rds")
Inference Diagnostics
Examining trace plots
mcmc_trace <- MCMC_diag(mcmc_stage_3, params = "trace")
Gelman-rubin test
MCMC_diag(mcmc_stage_3, params = "gelman")
Correlation between parameters
plot_corr(mcmc_stage_3)
Assessing Model Fit
Sampling from the posterior distribution
We will now plot the values predicted by the model to the observed values to assess how well our model is recreating the data (aka a posterior predictive check). To do this, we first sample a certain number of parameter sets from the posterior distribution.
# Sample posterior draws from MCMC
n_samples <- 200
param_samples <- sample_mcmc_steps(mcmc_stage_3$coda_pars, n_samples)
Defining population growth input
As population growth was used as part of the inference procedure, we also need to include population growth when comparing the model to the observed data.
# Create covariate matrix (population growth)
r_df <- get_population_scaling(
n = nrow(obs_cases),
month = FALSE,
growth_rate_C = param_inputs$r_C_0,
growth_rate_A = param_inputs$r_A_0
)
pop_growth_df <- data.frame(
date_ymd = obs_cases$date_ymd,
r_C = r_df$r_C
)
# Define transformation for SMC-adjusted incidence
mu_transform_C <- function(inc_df, param_inputs) {
inc_df$inc_C * inc_df$r_C
}
head(pop_growth_df)
Simulating cases using the best parameters
We start by extracting which parameter sets maximized the posterior distribution and simulate the weekly number of cases using these values
# Extract MAP (max posterior) parameter set
max_posterior_params <- extract_max_posterior_params(mcmc_stage_3)
param_inputs_best <- update_param_list(param_inputs, max_posterior_params)
# Simulate cases using the best parameter set (deterministic, no noise)
sim_best <- data_sim(
model = malaria_model,
param_inputs = param_inputs_best,
start_date = start_date,
end_date = end_date,
noise = FALSE,
mu_transform_C = mu_transform_C,
mu_transform_A = NULL,
covariate_matrix = pop_growth_df,
month = FALSE,
save = FALSE
)
sim_best_long <- tidyr::pivot_longer(sim_best, -date_ymd, names_to = "variable", values_to = "value")
Constructing the uncertainty (prediction) interval
Now, we simulate using the different parameter sets sampled from the posterior distribution. We also add observation noise coming from a negative binomial distribution. This provides an interval that gives the range of weekly malaria cases we’d be likely to observe.
# Simulate cases using different parameter sets coming from MCMC run
simulations <- run_simulations_from_samples(
model = malaria_model,
param_inputs = param_inputs_best,
param_samples = aram_samples,
start_date = start_date,
end_date = end_date,
prewarm_years = 2,
mu_transform_C = mu_transform_C,
mu_transform_A = NULL,
covariate_matrix = pop_growth_df,
noise = TRUE,
month = FALSE
)
The simulations are then summarized and we plot the resulting posterior predictive check.
# Summarize the simulations
ci_data <- summarize_simulation_ci(simulations, variables = "inc_C_transformed", ci_level = 0.95)
# Format the simulated and observed data for plotting
ppc_data <- prepare_ppc_data(
ci_data = ci_data,
best_df = sim_best_long,
obs_df = obs_cases,
sim_var = "inc_C_transformed",
obs_var = "inc_C"
)
# Plot the posterior predictive check
ppc_plot <- plot_ppc(
ppc_data = ppc_data,
ci_data = ci_data,
title = "Posterior Predictive Check",
xlab = "",
ylab = "Weekly Number of Cases",
show_obs = TRUE,
show_best = TRUE,
show_ribbon = TRUE
)
print(ppc_plot)