---
title: "Taxonomy"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Taxonomy}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  echo = TRUE,
  warning = FALSE,
  eval = T
)
```

```{r, echo = FALSE, eval = TRUE, include = FALSE, messages = FALSE}
library(bdc)
```

#### **Introduction**

The combination of large datasets from several sources requires careful harmonization of taxon names. Non-standardized, incorrect, ambiguous, or synonymous taxonomic names can lead to **unreliable results and conclusions**. The *bdc* package includes tools to help standardize **major taxonomic groups' names** (e.g., animals and plants) by comparing scientific names against **one out of 10 taxonomic databases**. The taxonomic harmonization process borrows heavily from [Norman et al. (2020); taxadb package)](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13440), which contains functions that allow **querying millions of taxonomic names** in a fast, automated, and consistent way using high-quality locally stored taxonomic databases.

The functions used to harmonize names (*bdc_clean_name* and *bdc_query_names_taxadb*) **contain several additions to the *taxadb* package**, including tools for:

1.  Cleaning and parsing scientific names;
2.  Resolving misspelled names or variant spellings using a fuzzy matching application;
3.  Converting nomenclatural synonyms to current accepted name;
4.  Flagging ambiguous results;
5.  Reporting the results.

The taxonomic harmonization is based on one taxonomic database chosen by the user. *taxadb* makes available the following taxonomic sources:

| Abbreviation | Taxonomic database                                        |
|--------------|-----------------------------------------------------------|
| col          | Catalogue of Life                                         |
| fb           | FishBase                                                  |
| itis         | Integrated Taxonomic Information System                   |
| iucn         | International Union for Conservation of Nature's Red List |
| gbif         | Global Biodiversity Information Facility                  |
| ncbi         | National Center for Biotechnology Information             |
| ott          | OpenTree Taxonomy                                         |
| tpl          | The Plant List                                            |
| slb          | SeaLifeBase                                               |
| wd           | Wikidata                                                  |

⚠️**IMPORTANT:**

Note that some taxonomic databases may be *momentarily unavailable* in taxadb. Check ?bdc_query_names_taxadb for a list of available taxonomic databases.

#### **Installation**

Check [**here**](https://brunobrr.github.io/bdc/#installation) how to install the bdc package.

#### **Reading the database**

Read the database created in the [**Pre-filter**](https://brunobrr.github.io/bdc/articles/prefilter.html) module of the *bdc* package. It is also possible to read any datasets containing the **required** fields to run the package (more details [here](https://brunobrr.github.io/bdc/articles/integrate_datasets.html)).

```{r echo=TRUE,eval=FALSE}
database <-
  readr::read_csv(here::here("Output/Intermediate/01_prefilter_database.csv"))
```

```{r echo=FALSE, eval=TRUE}
database <-
  readr::read_csv(
    system.file(
      "extdata/outpus_vignettes/01_prefilter_database.csv",
      package = "bdc"
    ),
    show_col_types = FALSE
  )
```

```{r echo=F, message=FALSE, warning=FALSE, eval=TRUE}
DT::datatable(
  database[1:15,], class = 'stripe', extensions = 'FixedColumns',
  rownames = FALSE,
  options = list(
    pageLength = 3,
    dom = 'Bfrtip',
    scrollX = TRUE,
    fixedColumns = list(leftColumns = 2)
  )
)
```

#### **Clean and parse species names**

Scientific names improperly formatted usually cannot be matched with valid names. To solve this issue, we developed the *bdc_clean_name* containing functionalities to **unify writing style of scientific names**. This optimize the taxonomic queries by increasing the probability of finding matching names. This tool is used to:

1.  Flag and remove family names pre-pended to species names;
2.  Flag and remove qualifiers denoting uncertain or provisional status of taxonomic identification (e.g., confer, species, affinis, and among others);
3.  Flag and remove infraspecific terms (e.g., variety [var.], subspecies [subsp], forma [f.], and their spelling variations);
4.  Standardize names, i.e., capitalize only the first letter of the genus name and remove extra whitespaces);
5.  Parse names, i.e., separate author, date, annotations from taxon name.

```{r eval=F, echo=T}
parse_names <-
  bdc_clean_names(sci_names = database$scientificName, save_outputs = FALSE)

#> >> Family names prepended to scientific names were flagged and removed from 0 records.
#> >> Terms denoting taxonomic uncertainty were flagged and removed from 1 records.
#> >> Other issues, capitalizing the first letter of the generic name, replacing empty names by NA, and removing extra spaces, were flagged and corrected or removed from 0 records.
#> >> Infraspecific terms were flagged and removed from 1 records.
#>
#> >> Scientific names were cleaned and parsed. Check the results in 'Output/Check/02_clean_names.csv'.
```

```{r echo=FALSE, eval=TRUE}
parse_names <-
  readr::read_csv(
    system.file(
      "extdata/outpus_vignettes/example_clean_names.csv",
      package = "bdc"
    ),
    show_col_types = FALSE
  )
```

An example of *bdc_clean_names* output.

```{r echo=F, message=FALSE, warning=FALSE, eval=TRUE}
DT::datatable(
  parse_names[60:75,],
  rownames = FALSE,
  class = 'stripe',
  extensions = 'FixedColumns',
  options = list(
    pageLength = 5,
    dom = 'Bfrtip',
    scrollX = TRUE,
    fixedColumns = list(leftColumns = 2)
  )
)
```

**Let's merge the names parsed with the complete database.** As the column 'scientificName' is in the same order in both databases (i.e., parse_names and database), we can append names parsed in the database. Also, only the columns "names_clean" and ".uncert_terms" will be used in the downstream analyses. But don't worry, you can check the results of the parsing names process in "Output/Check/02_parsed_names.qs".

```{r}
parse_names <-
  parse_names %>%
  dplyr::select(.uncer_terms, names_clean)

database <- dplyr::bind_cols(database, parse_names)
```

#### **Names harmonization**

The taxonomic harmonization is based upon **one** of those taxonomic authorities previously mentioned. It starts with creating a local database by downloading, extracting, and importing the taxonomic database informed by users using the *taxadb* package. The download may take some time, depending on the internet connection.

⚠️**IMPORTANT:** If will have a problem downloading databases, please consider removing the previous versions of taxonomic databases using `fs::dir_delete(taxadb:::taxadb_dir())`

```{r eval=F}
query_names <- bdc_query_names_taxadb(
  sci_name            = database$names_clean,
  replace_synonyms    = TRUE, # replace synonyms by accepted names?
  suggest_names       = TRUE, # try to found a candidate name for misspelled names?
  suggestion_distance = 0.9, # distance between the searched and suggested names
  db                  = "gbif", # taxonomic database
  rank_name           = "Plantae", # a taxonomic rank
  rank                = "kingdom", # name of the taxonomic rank
  parallel            = FALSE, # should parallel processing be used?
  ncores              = 2, # number of cores to be used in the parallelization process
  export_accepted     = FALSE # save names linked to multiple accepted names
)

#> A total of 0 NA was/were found in sci_name.
#>
#> 115 names queried in 3.1 minutes
```

```{r echo=F, eval=T}
query_names <-
  readr::read_csv(
    system.file(
      "extdata/outpus_vignettes/example_query_names.csv",
      package = "bdc"
    ), show_col_types = F
  )

DT::datatable(
  query_names[1:20,], class = 'stripe', extensions = 'FixedColumns',
  rownames = FALSE,
  options = list(
    pageLength = 5,
    dom = 'Bfrtip',
    scrollX = TRUE,
    fixedColumns = list(leftColumns = 2)
  )
)
```

Merging results of the taxonomy harmonization process with the original database. Before that, let's rename the column containing the original scientific names to **"verbatim_scientificName"**. From now on, **"scientificName" corresponds to the verified names** (resulted from the name harmonization process). As the column "original_search" in "query_names" and "names_clean" are equal, only the first will be kept.

```{r}
database <-
  database %>%
  dplyr::rename(verbatim_scientificName = scientificName) %>%
  dplyr::select(-names_clean) %>%
  dplyr::bind_cols(., query_names)
```

#### **Report**

The report is based on the **column notes** containing the results of the name harmonization process. The notes can be grouped into two categories: accepted names and those with a taxonomic issue or warning, needing further inspections. Accepted names are returned as "valid" in the column "Description". The report can be automatically saved if `save_report = TRUE.`

```{r eval = FALSE}
report <-
  bdc_create_report(data = database,
                    database_id = "database_id",
                    workflow_step = "taxonomy",
                    save_report = FALSE)

report
```

```{r echo=FALSE, eval=TRUE}
report <-
  readr::read_csv(
    system.file("extdata/outpus_vignettes/02_Report_taxonomy.csv", package = "bdc"),
    show_col_types = FALSE
  )
```

```{r echo=FALSE, message=FALSE, warning=FALSE, eval=TRUE}
DT::datatable(
  report, class = 'stripe', extensions = 'FixedColumns',
  rownames = FALSE,
  options = list(
    # pageLength = 5,
    dom = 'Bfrtip',
    scrollX = TRUE,
    fixedColumns = list(leftColumns = 2)
  )
)
```

#### **Unresolved names**

It is also possible to filter out records with taxonomic status different from "accepted". Such records may be potentially resolved manually.

```{r eval=FALSE}
unresolved_names <-
  bdc_filter_out_names(data = database,
                       col_name = "notes",
                       taxonomic_status = "accepted",
                       opposite = TRUE)
```

Save the table containing unresolved names

```{r eval=FALSE}
unresolved_names %>%
  readr::write_csv(., here::here("Output/Check/02_unresolved_names.csv"))
```

#### **Filtering the database**

It is possible to remove records with unresolved or invalid names to get a 'clean' database. However, to ensure that all records will be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal modules of the package), potentially erroneous or suspect records will be removed in the final module of the package.

```{r}
# output <-
#   bdc_filter_out_names(
#     data = database,
#     taxonomic_notes = "accepted",
#     opposite = FALSE
#   )
```

#### **Saving the database**

You can use [qs::qsave()]{.underline} instead of write_csv to save a large database in a compressed format.

```{r eval=FALSE}
# use qs::qsave() to save the database in a compressed format and then qs:qread() to load the database
database %>%
  readr::write_csv(.,
            here::here("Output", "Intermediate", "02_taxonomy_database.csv"))
```