collectResults • openMalariaUtilities

collectResults() is a powerful function and the aim of this vignette is to explain its options in more detail.

Output database

collectResults() creates the following database tables:

experiments: Which experiments are stored in the DB?
scenarios: Which scenarios per experiment are available?
placeholders: What placeholders are used per scenario?
scenarios_metadata: What metadata is available per scenario?
results: OpenMalaria output (per scenario).

Please note that all data are identified by experiment and possibly by scenario.

The table containing the simulation output (results) can be renamed by using the resultsName argument and its layout can be adjusted as needed via resultsCols. resultsCols needs to be a list where the names entry defines the column names and the types entry the SQLite column types.

The creation of an index can be controlled by the indexOn argument, which accepts a list of vectors. Each vector is a pair of the SQLite table and column name the index should be created on.

fileFun, readFun and aggrFun

collectResults() is performing three steps during the result collection:

Determine which OpenMalaria output files should be read
Reading these files
Perform aggregation

These three steps can be modified by the user and allow changing the behavior of collectResults(). Users can write their own functions with nearly no limitations and can pass them to collectResults()’s dedicated arguments.

fileFun

fileFun’s job is to return a vector of scenario XML file names. By default, the file column of the scenario data frame is loaded from the cache. Usually this would be a function which applies filtering to the scenarios data frame in order to determine the files.

readFun

readFun will called upon each found output file and has to return a data frame which is used by aggrFun or put into the database. readOutputFile() should be sufficient for most uses. In case you want to write your own implementation, make sure that the first argument the function accepts is the file name and it has R’s ellipsis ... in the arguments. This is necessary so the scenario ID can be passed automatically and can be used by your function via scenID.

aggrFun

aggrFun is probably the function most users want to ajust to their needs and where we do not provide any default. It needs to accept the data frame generated from readFun as a first argument and should provide the final data frame which will be put into the database. This data frame must not have a column named experiment_id and can contain a column named scenario_id. Furthermore, the output needs to match the layout defined in resultsCols.

Function arguments

The arguments for the above three functions have to be (named) list. If you need to pass unevaluated arguments, consider using quote() and bquote(). collectResults() will try to remove all quotes before the arguments are passed to the functions.

Parallelization and strategy

Users can control the number of CPU cores used for the calculations by the parameters ncores and ncoresDT. Similar to the scenario generation and simulation step, ncores determines how many R cluster nodes will be launched to perform the calculations. ncoresDT sets the number of threads available for data.table, which is used by us extensively. By default this is set to one in order to avoid nested parallelization.

The strategy argument accepts "serial" and "batch" as input values and defines how collectResults() is processing the files.

"batch" means that we will launch an R cluster to read all files in parallel into one single data table. Then we will apply aggregation function and add the whole output batch to the database. This process is the fastest by our experience when using SQLite but can require a lot of memory, depending on how many scenarios are aggregated.

"serial" means that we will launch an R cluster and each node will read a file, apply the aggregation function and send the result to the database. With SQLite, this process has been slower than the "batch" stragety but also less memory intensive. We expect it to be quite performant when a database system like PostgresSQL is used which allows multiple simultaneous write connections.