Enriches a dataset with harmonized taxonomic information by merging a compatibility table and generating synthetic entries when needed. This ensures that all observations are mapped at the species level—even when only complex-level data is available—and prevents imbalances in downstream statistical models.

augment_with_taxonomy(data, compat)

Arguments

data

A `data.frame` containing the dataset to be augmented. Must include a `survey` column, and optionally `species` and/or `complex` columns.

compat

A `data.frame` or tibble providing taxonomic mapping for `survey` identifiers, typically including `survey`, `species`, and `complex` columns.

Value

A named list with two elements:

`data`

The augmented dataset, with harmonized species/complex names and corresponding numeric IDs.

`species_complex`

A reference tibble mapping unique species and complex combinations to numeric codes.

Details

The following steps are applied to ensure taxonomic consistency:

  1. Taxonomic Join: The `compat` table is joined to the input dataset via the `survey` column.

  2. Synthetic Complex Names: If `complex` is missing (`""`) for a given species, a synthetic complex is created as `"[species]_complex"`.

  3. Synthetic Species Names: If an entry is missing `species` but has a valid `complex`, a pseudo-species is generated with the label `"unlabel_[complex]"` to ensure all rows are treated as species-level data.

  4. Numeric Identifiers:

    • `speciesNb`: numeric ID for each (real or synthetic) species.

    • `complexNb`: numeric ID for each complex, offset by 1 to avoid overlap.

These steps normalize taxonomic granularity in the dataset, preventing over- or under-representation of records during model estimation (e.g., of intercepts like `beta0`).