Augment Dataset with Species-Complex Taxonomic Harmonization — augment_with

Enriches a dataset with harmonized taxonomic information by merging a compatibility table and generating synthetic entries when needed. This ensures that all observations are mapped at the species level—even when only complex-level data is available—and prevents imbalances in downstream statistical models.

Usage

augment_with_taxonomy(data, compat)

Arguments

data: A `data.frame` containing the dataset to be augmented. Must include a `survey` column, and optionally `species` and/or `complex` columns.
compat: A `data.frame` or tibble providing taxonomic mapping for `survey` identifiers, typically including `survey`, `species`, and `complex` columns.

Value

A named list with two elements:

`data`: The augmented dataset, with harmonized species/complex names and corresponding numeric IDs.
`species_complex`: A reference tibble mapping unique species and complex combinations to numeric codes.

Details

The following steps are applied to ensure taxonomic consistency:

Taxonomic Join: The `compat` table is joined to the input dataset via the `survey` column.
Synthetic Complex Names: If `complex` is missing (`""`) for a given species, a synthetic complex is created as `"[species]_complex"`.
Synthetic Species Names: If an entry is missing `species` but has a valid `complex`, a pseudo-species is generated with the label `"unlabel_[complex]"` to ensure all rows are treated as species-level data.
Numeric Identifiers:
- `speciesNb`: numeric ID for each (real or synthetic) species.
- `complexNb`: numeric ID for each complex, offset by 1 to avoid overlap.

These steps normalize taxonomic granularity in the dataset, preventing over- or under-representation of records during model estimation (e.g., of intercepts like `beta0`).