--- title: "Standardization and integration of different datasets" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Standardization and integration of different datasets} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", echo = TRUE, warning = FALSE, eval = TRUE ) ``` ```{r, echo = FALSE, eval = TRUE, include = FALSE, messages = FALSE} library(bdc) ``` #### **Introduction** The first step of the *bdc* package handles the harmonization of heterogeneous datasets in a standard format simply and efficiently. How is this accomplished? Basically, by replacing the headers of original datasets with standardized terms. To do so, you have to fill out a **configuration table** to indicate which field names (i.e., column headers) of each original dataset match a list of **Darwin Core standard terms**. Once standardized, datasets are then integrated into a standardized database having a minimum set of terms required for sharing biodiversity data and metadata across a wide variety of biodiversity applications ([Simple Darwin Core standards](https://dwc.tdwg.org/terms/)). ------------------------------------------------------------------------ We demonstrate the package's usefulness using records on terrestrial plant species occurring in Brazil obtained from nine data aggregators (e.g., GIF, SpeciesLink, SiBBr, iDigBio, among others). The **example datasets** are available at Please, click on "Download all" to obtain all datasets. Next, copy the path in our computer where the datasets were saved. This path will be used in the Configuration Table (e.g., metadata) to read the datasets and create an **integrated and standardized database.**
⚠️**IMPORTANT:** - Original datasets must be formatted in comma-separated format (**.csv**); - When filling out the configuration table, please provide the **exact** name of a column of the original dataset and the **full path** of each original dataset; - Names of original datasets without a corresponding DarwinCore term must be filled as **NA** (not available); - The function is **adjustable** so that you can insert other fields in the configuration table according to your needs. In such cases, we strongly recommend that the added terms follow the Darwin Core standards. #### **Installation** Check [**here**](https://brunobrr.github.io/bdc/#installation) how to install the bdc package. #### **Read the configuration table** Read an example of the **configuration table**. You can download the table by clicking on the "CSV" button. Next, indicate the path to the folder containing the example datasets in the configuration table. ```{r echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE} metadata <- readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv", package = "bdc"), show_col_types = FALSE) ```
```{r echo=FALSE, message=FALSE, warning=FALSE, eval=TRUE} DT::datatable( metadata, class = 'stripe', extensions = c('FixedColumns', 'Buttons'), rownames = FALSE, options = list( #dom = 't', dom = 'Bfrtip', scrollX = TRUE, pageLength = 9, buttons = c('copy', 'csv', 'print'), fixedColumns = list(leftColumns = 2), editable = 'cell' ) ) ``` **Changing the path containing the datasets.** ```{r eval=FALSE} # Path to the folder containing the example datasets. For instance: path <- "C:/Users/myname/Documents/myproject/input_files/" # Change in the Configuration table the path to the folder in your computer containing the example datasets metadata$fileName <- gsub(pattern = "https://raw.githubusercontent.com/brunobrr/bdc/master/inst/extdata/input_files/", replacement = path, x = metadata$fileName) ```
#### **Standardization and integration of datasets** Now, let's merge the different datasets into a standardized database. Note that the standardized database integrating all dataset can be saved in the folder **"Output/Intermediate" as "00_merged_database"** if `save_database = TRUE` . The database is saved with a **"csv"** or **"qs"** extension, being "qs" a helpful format for quickly saving and reading large databases. "qs" files can be read using the function "qread" from the "qs" package. ```{r message=FALSE, warning=FALSE, eval=FALSE} database <- bdc_standardize_datasets(metadata = metadata, format = "csv", overwrite = TRUE, save_database = TRUE) #> 0sStandardizing AT_EPIPHYTES file #> 0s 0sStandardizing BIEN file #> 0s 0sStandardizing DRYFLOR file #> 0s 0sStandardizing GBIF file #> 0s 0sStandardizing ICMBIO file #> 0s 0sStandardizing IDIGBIO file #> 0s 0sStandardizing NEOTROPTREE file #> 0s 0sStandardizing SIBBR file #> 0s 0sStandardizing SPECIESLINK file #> #> C:/Users/Bruno R. Ribeiro/Desktop/bdc/Output/Intermediate/00_merged_database.csv was created ``` ```{r echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE} database <- readr::read_csv(system.file("extdata/outpus_vignettes/00_merged_database.csv", package = "bdc"), show_col_types = FALSE) ``` An example of a standardized database containing the required field to run the *bdc* package. ```{r echo=F, message=FALSE, warning=FALSE, eval=TRUE} DT::datatable( database[1:15,], class = 'stripe', extensions = 'FixedColumns', rownames = FALSE, options = list( pageLength = 5, dom = 'Bfrtip', scrollX = TRUE, fixedColumns = list(leftColumns = 2) ) ) ```
⚠️**IMPORTANT:** The standardized database embodies information on species taxonomy, geolocation, date of collection, and other relevant context information. Each field is classified in three categories according to its importance to run the function: i) **required**, i.e., the minimum information necessary to run the function, ii) **recommended**, i.e., not mandatory but having important details on species records, and iii) **additional**, i.e., information potentially useful for detailed data analyses. Below are listed the specifications of each field of the configuration table: - **Field**: Name of the fields in *DatabaseInfo.csv* to be filled in. - **Category**: Classification of each field in *DatabaseInfo.csv*. *required (RQ)*, i.e., the minimum information necessary to run the function, ii) *recommended (RE)*, i.e., not mandatory but having important details on species records, and iii) *additional (AD)*, i.e., information potentially useful for detailed data analyses. As general guidance, be careful to include all *required* fields and supply as many recommended and additional fields as possible. - **Description**: Description of the content of the specified field in the original database. - **Type**: Type of content data on the specified field in the original database. - **Example**: An example of a single content on the specified field in the original database.
```{r echo=TRUE, message=FALSE, warning=FALSE, eval=TRUE} config_description <- readr::read_csv(system.file("extdata/Config/DatabaseInfo_description.csv", package = "bdc"), show_col_types = FALSE) ``` ```{r echo=FALSE, message=FALSE, warning=FALSE, eval=TRUE} DT::datatable( config_description, rownames = FALSE, class = 'stripe', extensions = c('FixedColumns', 'Buttons'), options = list( #dom = 't', dom = 'Bfrtip', scrollX = TRUE, pageLength = 9, buttons = c('copy', 'csv', 'print'), fixedColumns = list(leftColumns = 2), editable = 'cell' ) ) ```