collectResults()
is a powerful function and the aim of
this vignette is to explain its options in more detail.
Output database
collectResults()
creates the following database
tables:
-
experiments
: Which experiments are stored in the DB? -
scenarios
: Which scenarios per experiment are available? -
placeholders
: What placeholders are used per scenario? -
scenarios_metadata
: What metadata is available per scenario? -
results
: OpenMalaria output (per scenario).
Please note that all data are identified by experiment and possibly by scenario.
The table containing the simulation output (results) can be renamed
by using the resultsName
argument and its layout can be
adjusted as needed via resultsCols
.
resultsCols
needs to be a list where the names
entry defines the column names and the types
entry the
SQLite column types.
The creation of an index can be controlled by the
indexOn
argument, which accepts a list of vectors. Each
vector is a pair of the SQLite table and column name the index should be
created on.
fileFun, readFun and aggrFun
collectResults()
is performing three steps during the
result collection:
- Determine which OpenMalaria output files should be read
- Reading these files
- Perform aggregation
These three steps can be modified by the user and allow changing the
behavior of collectResults()
. Users can write their own
functions with nearly no limitations and can pass them to
collectResults()
’s dedicated arguments.
fileFun
fileFun
’s job is to return a vector of scenario XML file
names. By default, the file
column of the scenario data
frame is loaded from the cache. Usually this would be a function which
applies filtering to the scenarios data frame in order to determine the
files.
readFun
readFun
will called upon each found output file and has
to return a data frame which is used by aggrFun
or put into
the database. readOutputFile()
should be sufficient for
most uses. In case you want to write your own implementation, make sure
that the first argument the function accepts is the file name and it has
R’s ellipsis ...
in the arguments. This is necessary so the
scenario ID can be passed automatically and can be used by your function
via scenID
.
aggrFun
aggrFun
is probably the function most users want to
ajust to their needs and where we do not provide any default. It needs
to accept the data frame generated from readFun
as a first
argument and should provide the final data frame which will be put into
the database. This data frame must not have a column named
experiment_id
and can contain a column named
scenario_id
. Furthermore, the output needs to match the
layout defined in resultsCols
.
Function arguments
The arguments for the above three functions have to be (named) list.
If you need to pass unevaluated arguments, consider using
quote()
and bquote()
.
collectResults()
will try to remove all quotes before the
arguments are passed to the functions.
Parallelization and strategy
Users can control the number of CPU cores used for the calculations
by the parameters ncores
and ncoresDT
. Similar
to the scenario generation and simulation step, ncores
determines how many R cluster nodes will be launched to perform the
calculations. ncoresDT
sets the number of threads available
for data.table
, which is used by us extensively. By default
this is set to one in order to avoid nested parallelization.
The strategy
argument accepts "serial"
and
"batch"
as input values and defines how
collectResults()
is processing the files.
"batch"
means that we will launch an R cluster to read
all files in parallel into one single data table. Then we will apply
aggregation function and add the whole output batch to the database.
This process is the fastest by our experience when using SQLite but can
require a lot of memory, depending on how many scenarios are
aggregated.
"serial"
means that we will launch an R cluster and each
node will read a file, apply the aggregation function and send the
result to the database. With SQLite, this process has been slower than
the "batch"
stragety but also less memory intensive. We
expect it to be quite performant when a database system like PostgresSQL
is used which allows multiple simultaneous write connections.