Title: | Biodiversity Data Cleaning |
---|---|
Description: | It brings together several aspects of biodiversity data-cleaning in one place. 'bdc' is organized in thematic modules related to different biodiversity dimensions, including 1) Merge datasets: standardization and integration of different datasets; 2) Pre-filter: flagging and removal of invalid or non-interpretable information, followed by data amendments; 3) Taxonomy: cleaning, parsing, and harmonization of scientific names from several taxonomic groups against taxonomic databases locally stored through the application of exact and partial matching algorithms; 4) Space: flagging of erroneous, suspect, and low-precision geographic coordinates; and 5) Time: flagging and, whenever possible, correction of inconsistent collection date. In addition, it contains features to visualize, document, and report data quality – which is essential for making data quality assessment transparent and reproducible. The reference for the methodology is Bruno et al. (2022) <doi:10.1111/2041-210X.13868>. |
Authors: | Bruno Ribeiro [aut, cre] , Santiago Velazco [aut] , Karlo Guidoni-Martins [aut] , Geiziane Tessarolo [aut] , Lucas Jardim [aut] , Steven Bachman [ctb] , Rafael Loyola [ctb] |
Maintainer: | Bruno Ribeiro <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.5 |
Built: | 2024-10-30 20:57:41 UTC |
Source: | https://github.com/brunobrr/bdc |
This function flags records with an informed basis of records (i.e., the records type, for example, a specimen, a human observation, or a fossil specimen) not interpretable, which does not comply with Darwin Core vocabulary, or unreliable or unsuitable for specific analyses.
bdc_basisOfRecords_notStandard( data, basisOfRecord = "basisOfRecord", names_to_keep = "all" )
bdc_basisOfRecords_notStandard( data, basisOfRecord = "basisOfRecord", names_to_keep = "all" )
data |
data.frame. Containing information about the basis of records. |
basisOfRecord |
character string. The column name with information about basis of records. Default = "basisOfRecord". |
names_to_keep |
character string. Elements of the column BasisOfRecords to keep. Default is "all", which considers a selected list of recommended standard Darwin Core classes (and their spelling variations, see details). By default, records missing (i.e., NA) or with "unknown" information about basis of records are kept. |
Users are encourage to select the set of basis of records classes to keep. Default = c("Event","HUMAN_OBSERVATION", "HumanObservation", "LIVING_SPECIMEN", "LivingSpecimen", "MACHINE_OBSERVATION", "MachineObservation", "MATERIAL_SAMPLE", "O", "Occurrence", "MaterialSample", "OBSERVATION", "Preserved Specimen", "PRESERVED_SPECIMEN", "preservedspecimen Specimen", "Preservedspecimen", "PreservedSpecimen", "preservedspecimen", "S", "Specimen", "Taxon", "UNKNOWN", "", NA)
A data.frame containing the column ".basisOfRecords_notStandard" .Compliant (TRUE) if 'basisOfRecord' is standard; otherwise "FALSE".
Other prefilter:
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_empty()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_outOfRange()
,
bdc_coordinates_transposed()
,
bdc_country_standardized()
,
bdc_scientificName_empty()
x <- data.frame(basisOfRecord = c( "FOSSIL_SPECIMEN", "UNKNOWN", "RON", NA, "Specimen", "PRESERVED_SPECIMEN" )) bdc_basisOfRecords_notStandard( data = x, basisOfRecord = "basisOfRecord", names_to_keep = "all" )
x <- data.frame(basisOfRecord = c( "FOSSIL_SPECIMEN", "UNKNOWN", "RON", NA, "Specimen", "PRESERVED_SPECIMEN" )) bdc_basisOfRecords_notStandard( data = x, basisOfRecord = "basisOfRecord", names_to_keep = "all" )
This function is composed of a series of name-checking routines for cleaning and parsing scientific names; i.e., unify writing style. It removes 1) family names of animals or plants pre-pended to species names, 2) qualifiers denoting the uncertain or provisional status of taxonomic identification (e.g., confer, species, affinis), and 3) infraspecific terms, for example, variety (var.), subspecies (subsp), forma (f.), and their spelling variations. It also includes applications to 4) standardize names, i.e., capitalize only the first letter of the genus name and remove extra whitespaces), and 5) parse names, i.e., separate author, date, annotations from taxon name.
bdc_clean_names(sci_names, save_outputs = FALSE)
bdc_clean_names(sci_names, save_outputs = FALSE)
sci_names |
character string. Containing scientific names. |
save_outputs |
logical. Should the outputs be saved? Default = FALSE. |
Terms denoting uncertainty or provisional status of taxonomic identification as well as infraspecific terms were obtained from Sigoviniet al. (2016; doi: 10.1111/2041-210X.12594). More details about the names parse process can be found in gnparser. Note: GNparser is not automatically installed. Please see guidelines to install gnparser.
A five-column data.frame including
scientificName: original names supplied
.uncer_terms: indicates the presence of taxonomic uncertainty terms
.infraesp_names: indicates the presence of infraspecific terms
name_clean: scientific names resulting from the cleaning and parsing processes
quality: an index indicating the quality of parsing process. It ranges from 0 to 4, being 1 no problem detected, 4 serious problems detected; a value of 0 indicates no interpretable name that was not parsed).
If save_outputs == TRUE, a data.frame containing all tests of the cleaning names process and the results of the parsing names process is saved in "Output/Check/02_parse_names.csv".
Other taxonomy:
bdc_filter_out_names()
,
bdc_query_names_taxadb()
## Not run: scientificName <- c( "Fridericia bahiensis (Schauer ex. DC.) L.G.Lohmann", "Peltophorum dubium (Spreng.) Taub. (Griseb.) Barneby", "Gymnanthes edwalliana (Pax & K.Hoffm.) Laurenio-Melo & M.F.Sales", "LEGUMINOSAE Senna aff. organensis (Glaz. ex Harms) H.S.Irwin & Barneby" ) bdc_clean_names(scientificName, save_outputs = FALSE) ## End(Not run)
## Not run: scientificName <- c( "Fridericia bahiensis (Schauer ex. DC.) L.G.Lohmann", "Peltophorum dubium (Spreng.) Taub. (Griseb.) Barneby", "Gymnanthes edwalliana (Pax & K.Hoffm.) Laurenio-Melo & M.F.Sales", "LEGUMINOSAE Senna aff. organensis (Glaz. ex Harms) H.S.Irwin & Barneby" ) bdc_clean_names(scientificName, save_outputs = FALSE) ## End(Not run)
This function flags geographic coordinates within a reference country. A spatial buffer can be added to the reference country to ensure that records in mangroves, marshes, estuaries, and records with low coordinate precision are not flagged as invalid.
bdc_coordinates_country_inconsistent( data, country_name, country = "country_suggested", lat = "decimalLatitude", lon = "decimalLongitude", dist = 0.1 )
bdc_coordinates_country_inconsistent( data, country_name, country = "country_suggested", lat = "decimalLatitude", lon = "decimalLongitude", dist = 0.1 )
data |
data.frame. Containing longitude and latitude. Coordinates must be expressed in decimal degrees and WGS84. |
country_name |
character string. Name of the country or countries to be considered. |
country |
character string. The column name with the country assignment of each record. It is recommended use a column with corrected and homogenized country names. Default = "country_suggested". |
lat |
character string. The column name with the latitude coordinates. Default = “decimallatitude”. |
lon |
character string. The column name with the longitude coordinates. Default = “decimallongitude”. |
dist |
numeric. The distance in decimal degrees used to created a buffer around the country. Default = 0.1 (~11 km at the equator). |
Multiple countries can be informed, but they are tested separately. The distance reported in the argument 'dist' is used to create a buffer around the reference country. Records within the reference country or at a specified distance from the coastline of the reference country (i.e., records within the buffer) are flagged as valid (TRUE). Note that records within the buffer but in other countries are flagged as invalid (FALSE). Records with invalid (e.g., NA or empty) and out-of-range coordinates are not tested and returned as TRUE.
A data.frame containing the column '.coordinates_country_inconsistent'. Compliant (TRUE) if coordinates fall within the boundaries plus a specified distance (if 'dist' is supplied) of 'country_name'; otherwise "FALSE".
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_empty()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_outOfRange()
,
bdc_coordinates_transposed()
,
bdc_country_standardized()
,
bdc_scientificName_empty()
## Not run: x <- data.frame( country = c("Brazil", "Brazil", "Bolivia", "Argentina", "Peru"), decimalLongitude = c(-40.6003, -39.6, -77.689288, NA, -76.352930), decimalLatitude = c(-19.9358, -13.016667, -20.5243, -35.345940, -11.851872) ) bdc_coordinates_country_inconsistent( data = x, country_name = c("Brazil", "Peru", "Argentina"), country = "country", lon = "decimalLongitude", lat = "decimalLatitude", dist = 0.1 ) ## End(Not run)
## Not run: x <- data.frame( country = c("Brazil", "Brazil", "Bolivia", "Argentina", "Peru"), decimalLongitude = c(-40.6003, -39.6, -77.689288, NA, -76.352930), decimalLatitude = c(-19.9358, -13.016667, -20.5243, -35.345940, -11.851872) ) bdc_coordinates_country_inconsistent( data = x, country_name = c("Brazil", "Peru", "Argentina"), country = "country", lon = "decimalLongitude", lat = "decimalLatitude", dist = 0.1 ) ## End(Not run)
This function flags records missing latitude or longitude coordinates.
bdc_coordinates_empty(data, lat = "decimalLatitude", lon = "decimalLongitude")
bdc_coordinates_empty(data, lat = "decimalLatitude", lon = "decimalLongitude")
data |
data.frame. Containing geographical coordinates. |
lat |
character string. The column name with latitude in decimal degrees and WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude in decimal degrees and WGS84. Default = "decimalLongitude". |
This test identifies records missing geographic coordinates (i.e., empty or not applicable NA longitude or latitude)
A data.frame containing the column ".coordinates_empty". Compliant (TRUE) if 'lat' and 'lon' are not empty; otherwise "FALSE".
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_outOfRange()
,
bdc_coordinates_transposed()
,
bdc_country_standardized()
,
bdc_scientificName_empty()
x <- data.frame( decimalLatitude = c(19.9358, -13.016667, NA, ""), decimalLongitude = c(-40.6003, -39.6, -20.5243, NA) ) bdc_coordinates_empty( data = x, lat = "decimalLatitude", lon = "decimalLongitude" )
x <- data.frame( decimalLatitude = c(19.9358, -13.016667, NA, ""), decimalLongitude = c(-40.6003, -39.6, -20.5243, NA) ) bdc_coordinates_empty( data = x, lat = "decimalLatitude", lon = "decimalLongitude" )
This function Identifies records whose coordinates can potentially be extracted from locality information.
bdc_coordinates_from_locality( data, lat = "decimalLatitude", lon = "decimalLongitude", locality = "locality", save_outputs = FALSE )
bdc_coordinates_from_locality( data, lat = "decimalLatitude", lon = "decimalLongitude", locality = "locality", save_outputs = FALSE )
data |
data.frame. Containing geographical coordinates and the column "locality'. |
lat |
character string. The column name with latitude in decimal degrees and WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude in decimal degrees and WGS84. Default = "decimalLongitude". |
locality |
character string. The column name with locality information. Default = "locality". |
save_outputs |
logical. Should a table containing transposed coordinates saved for further inspection? Default = FALSE. |
According to DarwinCore terminology, locality refers to "the specific description of the place" where an organism was recorded.
A data.frame containing records missing or with invalid coordinates but with potentially useful locality information. When save_outputs = FALSE the data.frame is saved in Output/Check/01_coordinates_from_locality.csv
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_empty()
,
bdc_coordinates_outOfRange()
,
bdc_coordinates_transposed()
,
bdc_country_standardized()
,
bdc_scientificName_empty()
x <- data.frame( lat = c(NA, NA, ""), lon = c("", NA, NA), locality = c("PARAGUAY: ALTO PARAGUAY: CO.; 64KM W PUERTO SASTRE", "Parque Estadual da Serra de Caldas Novas, Goias, Brazil", "Parque Nacional Iguazu")) bdc_coordinates_from_locality( data = x, lat = "lat", lon = "lon", locality = "locality", save_outputs = FALSE)
x <- data.frame( lat = c(NA, NA, ""), lon = c("", NA, NA), locality = c("PARAGUAY: ALTO PARAGUAY: CO.; 64KM W PUERTO SASTRE", "Parque Estadual da Serra de Caldas Novas, Goias, Brazil", "Parque Nacional Iguazu")) bdc_coordinates_from_locality( data = x, lat = "lat", lon = "lon", locality = "locality", save_outputs = FALSE)
This function identifies records with out-of-range coordinates (not between -90 and 90 for latitude; between -180 and 180 for longitude).
bdc_coordinates_outOfRange( data, lat = "decimalLatitude", lon = "decimalLongitude" )
bdc_coordinates_outOfRange( data, lat = "decimalLatitude", lon = "decimalLongitude" )
data |
data.frame. Containing geographical coordinates. Coordinates must be expressed in decimal degrees and WGS84. |
lat |
character string. The column name with latitude in decimal degree and in WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude in decimal degree and in WGS84. Default = "decimalLongitude". |
A data.frame containing the column ".coordinates_outOfRange". Compliant (TRUE) if 'lat' and 'lon' are not out-of-range; otherwise "FALSE".
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_empty()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_transposed()
,
bdc_country_standardized()
,
bdc_scientificName_empty()
x <- data.frame( decimalLatitude = c(-185.111, -43.34, "", -21.8069444), decimalLongitude = c(-45.4, -39.6, -20.5243, -440.9055555) ) bdc_coordinates_outOfRange( data = x, lat = "decimalLatitude", lon = "decimalLongitude" )
x <- data.frame( decimalLatitude = c(-185.111, -43.34, "", -21.8069444), decimalLongitude = c(-45.4, -39.6, -20.5243, -440.9055555) ) bdc_coordinates_outOfRange( data = x, lat = "decimalLatitude", lon = "decimalLongitude" )
This function flags records with a coordinate precision below a specified number of decimal places. Coordinates with one, two, or three decimal places present a precision of~11.1 km, ~1.1 km, and ~111 m at the equator, respectively.
bdc_coordinates_precision( data, lat = "decimalLatitude", lon = "decimalLongitude", ndec = c(0, 1, 2) )
bdc_coordinates_precision( data, lat = "decimalLatitude", lon = "decimalLongitude", ndec = c(0, 1, 2) )
data |
data.frame. A data.frame containing geographic coordinates. |
lat |
character string. The column with latitude in decimal degrees and WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude in decimal degrees and WGS84. Default = "decimalLongitude". |
ndec |
numeric. The minimum number of decimal places that the coordinates should have to be considered valid. Default = 2. |
A data.frame with logical values indicating whether values are equal or higher than the specified minimum decimal number (ndec). Coordinates flagged as FALSE in .rou column are considered imprecise.
x <- data.frame( lat = c(-21.34, 23.567, 16.798, -10.468), lon = c(-55.38, -13.897, 30.8, 90.675) ) bdc_coordinates_precision( data = x, lat = "lat", lon = "lon", ndec = 3 )
x <- data.frame( lat = c(-21.34, 23.567, 16.798, -10.468), lon = c(-55.38, -13.897, 30.8, 90.675) ) bdc_coordinates_precision( data = x, lat = "lat", lon = "lon", ndec = 3 )
This function flags and corrects records when latitude and longitude appear to be transposed.
bdc_coordinates_transposed( data, id = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE )
bdc_coordinates_transposed( data, id = "database_id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE )
data |
data.frame. Containing a unique identifier for each record, geographical coordinates, and country names. Coordinates must be expressed in decimal degrees and WGS84. |
id |
character string. The column name with a unique record identifier. Default = "database_id". |
sci_names |
character string. The column name with species scientific name. Default = "scientificName". |
lat |
character string. The column name with latitude. Coordinates must be expressed in decimal degrees and WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude. Coordinates must be expressed in decimal degrees and WGS84. Default = "decimalLongitude". |
country |
character string. The column name with the country assignment of each record. Default = "country". |
countryCode |
character string. The column name with an ISO-2 country code. |
border_buffer |
numeric >= 0. A distance in decimal degrees used to created a buffer around the country. Records within a given country and at a specified distance from the border will be not be corrected. Default = 0.2 (~22 km at the equator). |
save_outputs |
logical. Should a table containing transposed coordinates saved for further inspection? Default = FALSE. |
This test identifies transposed coordinates resulted from mismatches
between the country informed for a record and coordinates. Transposed
coordinates often fall outside of the indicated country (i.e., in other
countries or in the sea). Different coordinate transformations are
performed to correct country/coordinates mismatches. Importantly, verbatim
coordinates are replaced by the corrected ones in the returned database. A
database containing verbatim and corrected coordinates is created in
"Output/Check/01_coordinates_transposed.csv" if save_outputs == TRUE. The
columns "country" and "countryCode" can be retrieved by using the function
bdc_country_standardized
.
A data.frame containing the column "coordinates_transposed" indicating if verbatim coordinates were not transposed (TRUE). Otherwise records are flagged as (FALSE) and, in this case, verbatim coordinates are replaced by corrected coordinates.
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_empty()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_outOfRange()
,
bdc_country_standardized()
,
bdc_scientificName_empty()
## Not run: id <- c(1, 2, 3, 4) scientificName <- c( "Rhinella major", "Scinax ruber", "Siparuna guianensis", "Psychotria vellosiana" ) decimalLatitude <- c(63.43333, -14.43333, -41.90000, -46.69778) decimalLongitude <- c(-17.90000, -67.91667, -13.25000, -13.82444) country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil") x <- data.frame( id, scientificName, decimalLatitude, decimalLongitude, country ) # Get country code x <- bdc_country_standardized(data = x, country = "country") bdc_coordinates_transposed( data = x, id = "id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE ) ## End(Not run)
## Not run: id <- c(1, 2, 3, 4) scientificName <- c( "Rhinella major", "Scinax ruber", "Siparuna guianensis", "Psychotria vellosiana" ) decimalLatitude <- c(63.43333, -14.43333, -41.90000, -46.69778) decimalLongitude <- c(-17.90000, -67.91667, -13.25000, -13.82444) country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil") x <- data.frame( id, scientificName, decimalLatitude, decimalLongitude, country ) # Get country code x <- bdc_country_standardized(data = x, country = "country") bdc_coordinates_transposed( data = x, id = "id", sci_names = "scientificName", lat = "decimalLatitude", lon = "decimalLongitude", country = "country_suggested", countryCode = "countryCode", border_buffer = 0.2, save_outputs = FALSE ) ## End(Not run)
Country names derived from valid geographic coordinates are added to records missing country names.
bdc_country_from_coordinates( data, lat = "decimalLatitude", lon = "decimalLongitude", country = "country" )
bdc_country_from_coordinates( data, lat = "decimalLatitude", lon = "decimalLongitude", country = "country" )
data |
data.frame. Containing geographical coordinates and country names. |
lat |
character string. The column name with latitude in decimal degrees and WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude in decimal degrees and WGS84. Default = "decimalLongitude". |
country |
character string. The column name with the country assignment of each record. Default = "country". If no column name is provided a new column "country" is created. |
This function assigns a country name for records missing such information. Country names are extracted from valid geographic coordinates using a high-quality map of the world (rnaturalearth package). No country name is added to records whose coordinates are in the sea.
A tibble containing country names for records missing such information.
## Not run: x <- data.frame( decimalLatitude = c(-22.9834, -39.857030, -17.06811, -46.69778), decimalLongitude = c(-69.095, -68.443588, 37.438108, -13.82444), country = c("", NA, NA, "Brazil")) bdc_country_from_coordinates( data = x, lat = "decimalLatitude", lon = "decimalLongitude", country = "country" ) ## End(Not run)
## Not run: x <- data.frame( decimalLatitude = c(-22.9834, -39.857030, -17.06811, -46.69778), decimalLongitude = c(-69.095, -68.443588, 37.438108, -13.82444), country = c("", NA, NA, "Brazil")) bdc_country_from_coordinates( data = x, lat = "decimalLatitude", lon = "decimalLongitude", country = "country" ) ## End(Not run)
This function standardizes country names and adds a new column to the database containing two-letter country codes (ISO 3166-1 alpha-2).
bdc_country_standardized(data, country = "country")
bdc_country_standardized(data, country = "country")
data |
data.frame. Containing country names |
country |
character string. The column name with the country assignment of each record. Default = "country". |
Country names are standardized using an exact matching against a list of country names in several languages from International Organization for Standardization. If any unmatched names remain, a fuzzy matching algorithm is used to find potential candidates for each misspelled countries names.
A data.frame containing two columns: country_suggested (standardized country names) and country_code (two-letter country codes; more details in World Countries, International Organization for Standardization).
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_empty()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_outOfRange()
,
bdc_coordinates_transposed()
,
bdc_scientificName_empty()
## Not run: country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil", "BREZIL") x <- data.frame(country) bdc_country_standardized( data = x, country = "country" ) ## End(Not run)
## Not run: country <- c("BOLIVIA", "bolivia", "Brasil", "Brazil", "BREZIL") x <- data.frame(country) bdc_country_standardized( data = x, country = "country" ) ## End(Not run)
Creates figures (i.e., bar plots, maps, and histograms) reporting the results of data quality tests implemented in the bdc package.
bdc_create_figures( data, database_id = "database_id", workflow_step = NULL, save_figures = FALSE )
bdc_create_figures( data, database_id = "database_id", workflow_step = NULL, save_figures = FALSE )
data |
data.frame. Containing the results of data quality tests; that is, columns starting wit ".". |
database_id |
character string. The column name with a unique record identifier. Default = "database_id". |
workflow_step |
character string. Name of the workflow step. Options available are "prefilter", "space", and "time". |
save_figures |
logical. Should the figures be saved for further inspection? Default = FALSE. |
This function creates figures based on the results of data quality tests implemented. A pre-defined list of test names is used for creating figures depending on the name of the workflow step informed. Figures are saved in "Output/Figures" if save_figures == TRUE.
List containing figures showing the results of data quality test implemented in one module of bdc. When save_figures = TRUE, figures are also saved locally in a png format.
## Not run: database_id <- c("GBIF_01", "GBIF_02", "GBIF_03", "FISH_04", "FISH_05") lat <- c(-19.93580, -13.01667, -22.34161, -6.75000, -15.15806) lon <- c(-40.60030, -39.60000, -49.61017, -35.63330, -39.52861) .scientificName_emptys <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .coordinates_empty <- c(TRUE, TRUE, TRUE, TRUE, TRUE) .invalid_basis_of_records <- c(TRUE, FALSE, TRUE, FALSE, TRUE) .summary <- c(TRUE, FALSE, TRUE, FALSE, FALSE) x <- data.frame( database_id, lat, lon, .scientificName_emptys, .coordinates_empty, .invalid_basis_of_records, .summary ) figures <- bdc_create_figures( data = x, database_id = "database_id", workflow_step = "prefilter", save_figures = FALSE ) ## End(Not run)
## Not run: database_id <- c("GBIF_01", "GBIF_02", "GBIF_03", "FISH_04", "FISH_05") lat <- c(-19.93580, -13.01667, -22.34161, -6.75000, -15.15806) lon <- c(-40.60030, -39.60000, -49.61017, -35.63330, -39.52861) .scientificName_emptys <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .coordinates_empty <- c(TRUE, TRUE, TRUE, TRUE, TRUE) .invalid_basis_of_records <- c(TRUE, FALSE, TRUE, FALSE, TRUE) .summary <- c(TRUE, FALSE, TRUE, FALSE, FALSE) x <- data.frame( database_id, lat, lon, .scientificName_emptys, .coordinates_empty, .invalid_basis_of_records, .summary ) figures <- bdc_create_figures( data = x, database_id = "database_id", workflow_step = "prefilter", save_figures = FALSE ) ## End(Not run)
Create a report summarizing the results of data quality tests
bdc_create_report( data, database_id = "database_id", workflow_step, save_report = FALSE )
bdc_create_report( data, database_id = "database_id", workflow_step, save_report = FALSE )
data |
data.frame. Containing a unique identifier for each record and the results of data quality tests. |
database_id |
character string. The column name with a unique record identifier.Default = "database_id". |
workflow_step |
character string containing the following options("prefilter", "taxonomy", "space" or "time"). |
save_report |
logical. Should the report be saved for further inspection? Default = FALSE. |
A data.frame containing a report summarizing the results of data quality assessment.
## Not run: database_id <- c("test_1", "test_2", "test_3", "test_4", "test_5") .missing_names <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .missing_coordinates <- c(TRUE, FALSE, FALSE, TRUE, FALSE) .basisOfRecords_notStandard <- c(TRUE, TRUE, FALSE, TRUE, TRUE) .summary <- c(TRUE, FALSE, FALSE, FALSE, FALSE) x <- data.frame( database_id, .missing_names, .missing_coordinates, .basisOfRecords_notStandard, .summary ) report <- bdc_create_report( data = x, database_id = "database_id", workflow_step = "prefilter", save_report = FALSE ) ## End(Not run)
## Not run: database_id <- c("test_1", "test_2", "test_3", "test_4", "test_5") .missing_names <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .missing_coordinates <- c(TRUE, FALSE, FALSE, TRUE, FALSE) .basisOfRecords_notStandard <- c(TRUE, TRUE, FALSE, TRUE, TRUE) .summary <- c(TRUE, FALSE, FALSE, FALSE, FALSE) x <- data.frame( database_id, .missing_names, .missing_coordinates, .basisOfRecords_notStandard, .summary ) report <- bdc_create_report( data = x, database_id = "database_id", workflow_step = "prefilter", save_report = FALSE ) ## End(Not run)
This function identifies records missing information on an event date (i.e., when a record was collected or observed).
bdc_eventDate_empty(data, eventDate = "eventDate")
bdc_eventDate_empty(data, eventDate = "eventDate")
data |
A data frame containing column with event date information. |
eventDate |
Numeric or date. The column with event date information. |
This test identifies records missing event date information (i.e., empty or not applicable NA).
A data.frame containing the column ".eventDate_empty". Compliant (TRUE) if 'eventDate' is not empty; otherwise "FALSE".
Other time:
bdc_year_from_eventDate()
,
bdc_year_outOfRange()
collection_date <- c( NA, "31/12/2015", "2013-06-13T00:00:00Z", "2013-06-20", "", "2013", "0001-01-00" ) x <- data.frame(collection_date) bdc_eventDate_empty(data = x, eventDate = "collection_date")
collection_date <- c( NA, "31/12/2015", "2013-06-13T00:00:00Z", "2013-06-20", "", "2013", "0001-01-00" ) x <- data.frame(collection_date) bdc_eventDate_empty(data = x, eventDate = "collection_date")
This function filters out columns containing the results of data quality tests (i.e., columns starting with '.') or other columns specified.
bdc_filter_out_flags(data, col_to_remove = "all")
bdc_filter_out_flags(data, col_to_remove = "all")
data |
data.frame. Containing columns to be removed. |
col_to_remove |
logical. Which columns should be removed? Default = "all", which means that all columns containing the results of data quality tests are removed. |
A data.frame without columns specified in 'col_to_remove'.
x <- data.frame( database_id = c("test_1", "test_2", "test_3", "test_4", "test_5"), kindom = c("Plantae", "Plantae", "Animalia", "Animalia", "Plantae"), .bdc_scientificName_empty = c(TRUE, TRUE, TRUE, FALSE, FALSE), .bdc_coordinates_empty = c(TRUE, FALSE, FALSE, FALSE, FALSE), .bdc_coordinates_outOfRange = c(TRUE, FALSE, FALSE, FALSE, FALSE), .summary = c(TRUE, FALSE, FALSE, FALSE, FALSE) ) bdc_filter_out_flags( data = x, col_to_remove = "all" )
x <- data.frame( database_id = c("test_1", "test_2", "test_3", "test_4", "test_5"), kindom = c("Plantae", "Plantae", "Animalia", "Animalia", "Plantae"), .bdc_scientificName_empty = c(TRUE, TRUE, TRUE, FALSE, FALSE), .bdc_coordinates_empty = c(TRUE, FALSE, FALSE, FALSE, FALSE), .bdc_coordinates_outOfRange = c(TRUE, FALSE, FALSE, FALSE, FALSE), .summary = c(TRUE, FALSE, FALSE, FALSE, FALSE) ) bdc_filter_out_flags( data = x, col_to_remove = "all" )
This function is useful for selecting records according to their taxonomic status. By default, only records with accepted scientific names are returned.
bdc_filter_out_names( data, col_name = "notes", taxonomic_status = "accepted", opposite = FALSE )
bdc_filter_out_names( data, col_name = "notes", taxonomic_status = "accepted", opposite = FALSE )
data |
data.frame. Containing the column "notes" with information on the taxonomic status of scientific names. |
col_name |
character string. The column name containing notes about the taxonomic status of a name. Default = "notes". |
taxonomic_status |
character string. Taxonomic status of a name. Default = "accepted". |
opposite |
logical. Should taxonomic status different from those listed in 'taxonomic_status' be returned? Default = FALSE |
By default, only records with accepted scientific names are kept in
the database. Such records are listed in the column 'taxonomic_status' as
"accepted", "accepted | replaceSynonym", "accepted | wasMisspelled" or
"accepted | wasMisspelled | replaceSynonym". It is also possible to
customize the list of taxonomic notes to be kept in the argument
'taxonomic_status'. See 'notes' in the data.frame resulted from the function
bdc_create_report
. If 'opposite' is TRUE, records with notes
different from names listed in 'taxonomic_status' are returned.
A data.frame filtered out according to names listed in 'taxonomic_status'.
Other taxonomy:
bdc_clean_names()
,
bdc_query_names_taxadb()
df_notes <- data.frame( notes = c( "notFound", "accepted", "accepted | replaceSynonym", "accepted | wasMisspelled", "accepted | wasMisspelled | replaceSynonym", "multipleAccepted", "heterotypic synonym" ) ) bdc_filter_out_names( data = df_notes, taxonomic_status = "accepted", col_name = "notes", opposite = FALSE )
df_notes <- data.frame( notes = c( "notFound", "accepted", "accepted | replaceSynonym", "accepted | wasMisspelled", "accepted | wasMisspelled | replaceSynonym", "multipleAccepted", "heterotypic synonym" ) ) bdc_filter_out_names( data = df_notes, taxonomic_status = "accepted", col_name = "notes", opposite = FALSE )
Harmonizing taxon names against local stored taxonomic databases
bdc_query_names_taxadb( sci_name, replace_synonyms = TRUE, suggest_names = TRUE, suggestion_distance = 0.9, db = "gbif", rank_name = NULL, rank = NULL, parallel = FALSE, ncores = 2, export_accepted = FALSE )
bdc_query_names_taxadb( sci_name, replace_synonyms = TRUE, suggest_names = TRUE, suggestion_distance = 0.9, db = "gbif", rank_name = NULL, rank = NULL, parallel = FALSE, ncores = 2, export_accepted = FALSE )
sci_name |
character string. Containing scientific names to be queried. |
replace_synonyms |
logical. Should synonyms be replaced by accepted names? Default = TRUE. |
suggest_names |
logical. Tries to find potential candidate names for misspelled names not resolved by an exact match. Default = TRUE. |
suggestion_distance |
numeric. A threshold value determining the acceptable orthographical distance between searched and candidate names. Names with matching distance value lower threshold informed are returned as NA. Default = 0.9. |
db |
character string. The name of the taxonomic database to be used in harmonizing taxon names. Default = "gbif". Use "all" to install all available taxonomic databases automatically. |
rank_name |
character string. Taxonomic rank name (e.g. "Plantae", "Animalia", "Aves", "Carnivora". Default = NULL. |
rank |
character string. A taxonomic rank used to filter the taxonomic database. Options available are: "kingdom", "phylum", "class", "order", "family", and "genus". |
parallel |
logical. Should a parallelization process be used? Default=FALSE |
ncores |
numeric. The number of cores to run in parallel. |
export_accepted |
logical. Should a table containing records with names linked to multiple accepted names saved for further inspection. Default = FALSE. |
The taxonomic harmonization is based upon one taxonomic authority database. The lastest version of each database is used to perform queries, but note that only older versions are available for some taxonomic databases. The database version is shown in parenthesis. Note that some databases are momentary unavailable in taxadb.
itis: Integrated Taxonomic Information System (v. 2022)
ncbi: National Center for Biotechnology Information (v. 2022)
col: Catalogue of Life (v. 2022)
tpl: The Plant List (v. 2019)
gbif: Global Biodiversity Information Facility (v. 2022)
fb: FishBase (v. 2019)
slb: SeaLifeBase (unavailable)
wd: Wikidata (unavailable)
ott: OpenTree Taxonomy (v. 2021)
iucn: International Union for Conservation of Nature (v. 2019)
The bdc_query_names_taxadb processes as this:
Creation of a local taxonomic database
This is a one-time setup used to download, extract, and import the taxonomic databases specified in the argument "db". The downloading process may take a few minutes depending on your connection and database size. By default, the "gbif" database following a Darwin Core schema is installed. (see ?taxadb::td_create for details).
Taxonomic harmonization
The taxonomic harmonization is divided into two distinct phases according to the matching type to be undertaken.
Exact matching
Firstly, the algorithm attempts to find an exact matching for each original scientific name supplied using the function "filter_name" from taxadb package. If an exact matching cannot be found, names are returned as Not Available (NA). Also, it is possible that a scientific name match multiple accepted names. In such cases, the "bdc_clean_duplicates" function is used to flag and remove names with multiple accepted names.
Information on higher taxa (e.g., kingdom or phylum) can be used to disambiguate names linked to multiple accepted names. For example, the genus "Casearia" is present in both Animalia and Plantae kingdoms. When handling names of Plantae, it would be helpful to get rid of names belonging to the Animalia to avoid flagging "Caseria" as having multiple accepted names. Following Norman et al. (2020), such cases are left to be fixed by the user. If "export_accepted" = TRUE a database containing a list of all records with names linked to multiple accepted names is saved in the "Output" folder.
Fuzzy matching
Fuzzy matching will be applied when "suggest_names" is TRUE and only for names not resolved by an exact match. In such cases, a fuzzy matching algorithm processes name-matching queries to find a potential matching candidate from the specified taxonomic database. Fuzzy matching identifies probable names (here identified as suggested names) for original names via a measure of orthographic similarity (i.e., distance). Orthographic distance is calculated by optimal string alignment (restricted Damerau-Levenshtein distance) that counts the number of deletions, insertions, substitutions, and adjacent characters' transpositions. It ranges from 0 to 1, being 1 an indicative of a perfect match. A threshold distance, i.e. the lower value of match acceptable, can be informed by user (in the "suggest_distance" argument). If the distance of a candidate name is equal or higher than the distance informed by user, the candidate name is returned as suggested name. Otherwise, names are returned as NA.
To increase the probability of finding a potential match candidate and to save time, two steps are taken before conducting fuzzy matching. First, if supplied, information on higher taxon (e.g., kingdom, family) is used to filter the taxonomic database. This step removes matching ambiguity by avoiding matching names from unrelated taxonomic ranks (e.g., match a plant species against a taxonomic database containing animal names) and decreases the number of names in the taxonomic database used to calculate the matching distance. Then, the taxonomic database is filtered according to a set of firsts letters of all input names. This process reduces the number of names in the taxonomic database to which each original name should be compared When a same suggested name is returned for different input names, a warning is returned asking users to check whether the suggested name is valid.
Report
The name harmonization processes' quality can be accessed in the column "notes" placed in the table resulting from the name harmonization process. The column "notes" contains assertions on the name harmonization process based on Carvalho (2017). The notes can be grouped in two categories: accepted names and those with a taxonomic issue or warning, needing further inspections. Accepted names can be returned as "accepted" (valid accepted name), "replaceSynonym" (a synonym replaced by an accepted name), "wasMisspelled" (original name was misspelled), "wasMisspelled | replaceSynonym" (misspelled synonym replaced by an accepted name), and "synonym" (original names is a synonym without accepted names in the database). Similarly, the following notes are used to flag taxonomic issues: "notFound" (no matching name found), "multipleAccepted" (name with multiple accepted names), "noAcceptedName" (no accepted name found), and ambiguous synonyms such as "heterotypic synonym", "homotypic synonym", and "pro-parte synonym". Ambiguous synonyms, names that have been published more than once describing different species, have more than one accepted name and cannot be resolved. Such cases are flagged and left to be determined by the user.
This function returns data.frame containing the results of the taxonomic harmonization process. The database is returned in the same order of sci_name.
Other taxonomy:
bdc_clean_names()
,
bdc_filter_out_names()
if (interactive()) { sci_name <- c( "Polystachya estrellensis", "Tachigali rubiginosa", "Oxalis rhombeo ovata", "Axonopus canescens", "Prosopis", "Haematococcus salinus", "Monas pulvisculus", "Cryptomonas lenticulari", "Poincianella pyramidalis", "Hymenophyllum polyanthos" ) names_harmonization <- bdc_query_names_taxadb( sci_name, replace_synonyms = TRUE, suggest_names = TRUE, suggestion_distance = 0.9, db = "gbif", parallel = TRUE, ncores = 2, export_accepted = FALSE ) }
if (interactive()) { sci_name <- c( "Polystachya estrellensis", "Tachigali rubiginosa", "Oxalis rhombeo ovata", "Axonopus canescens", "Prosopis", "Haematococcus salinus", "Monas pulvisculus", "Cryptomonas lenticulari", "Poincianella pyramidalis", "Hymenophyllum polyanthos" ) names_harmonization <- bdc_query_names_taxadb( sci_name, replace_synonyms = TRUE, suggest_names = TRUE, suggestion_distance = 0.9, db = "gbif", parallel = TRUE, ncores = 2, export_accepted = FALSE ) }
Creates a map of points using ggplot2 useful for inspecting the results of tests implemented in the bdc package.
bdc_quickmap( data, lat = "decimalLatitude", lon = "decimalLongitude", col_to_map = "red", size = 1 )
bdc_quickmap( data, lat = "decimalLatitude", lon = "decimalLongitude", col_to_map = "red", size = 1 )
data |
data.frame. Containing geographical coordinates. Coordinates must be expressed in decimal degree and in WGS84. |
lat |
character string. The column name with latitude. Coordinates must be expressed in decimal degree and in WGS84. Default = "decimalLatitude". |
lon |
character string. The column with longitude. Coordinates must be expressed in decimal degree and in WGS84. Default = "decimalLongitude". |
col_to_map |
character string. Defining the column or color used to map. It can be a color name (e.g., "red") or the name of a column of data. Default = "blue" |
size |
numeric. The size of the points. |
Only records with valid coordinates can be plotted. Records missing or containing invalid coordinates are removed prior creating the map.
A map of points created using ggplot2.
decimalLatitude <- c(19.9358, -13.016667, -19.935800) decimalLongitude <- c(-40.6003, -39.6, -40.60030) .coordinates_out_country <- c(FALSE, TRUE, TRUE) x <- data.frame(decimalLatitude, decimalLongitude, .coordinates_out_country) bdc_quickmap( data = x, lat = "decimalLatitude", lon = "decimalLongitude", col_to_map = ".coordinates_out_country", size = 1 )
decimalLatitude <- c(19.9358, -13.016667, -19.935800) decimalLongitude <- c(-40.6003, -39.6, -40.60030) .coordinates_out_country <- c(FALSE, TRUE, TRUE) x <- data.frame(decimalLatitude, decimalLongitude, .coordinates_out_country) bdc_quickmap( data = x, lat = "decimalLatitude", lon = "decimalLongitude", col_to_map = ".coordinates_out_country", size = 1 )
Flags records with empty or not interpretable scientific names.
bdc_scientificName_empty(data, sci_names = "scientificName")
bdc_scientificName_empty(data, sci_names = "scientificName")
data |
data.frame. Containing the species scientific names. |
sci_names |
character string. The column name with the species scientific name. Default = "scientificName". |
This test identifies records missing scientific names (i.e., empty or not applicable NA names)
A data.frame containing the column ".scientificName_empty". Compliant (TRUE) if 'sci_names' is not empty; otherwise "FALSE".
Other prefilter:
bdc_basisOfRecords_notStandard()
,
bdc_coordinates_country_inconsistent()
,
bdc_coordinates_empty()
,
bdc_coordinates_from_locality()
,
bdc_coordinates_outOfRange()
,
bdc_coordinates_transposed()
,
bdc_country_standardized()
x <- data.frame(scientificName = c("Ocotea odorifera", NA, "Panthera onca", "")) bdc_scientificName_empty(data = x, sci_names = "scientificName")
x <- data.frame(scientificName = c("Ocotea odorifera", NA, "Panthera onca", "")) bdc_scientificName_empty(data = x, sci_names = "scientificName")
This function's main goal is to merge and standardize different datasets into a new dataset with column names following the Darwin Core terminology. All the process is based on a metadata file provided by the user.
bdc_standardize_datasets( metadata, format = "csv", overwrite = FALSE, save_database = FALSE )
bdc_standardize_datasets( metadata, format = "csv", overwrite = FALSE, save_database = FALSE )
metadata |
A data frame with metadata containing information about the name, path, and columns of the original data set which need to be renamed. See @details. |
format |
a character setting the output file type. Option available are "csv" and "qs" (recommenced to save large datasets). Default == "csv". |
overwrite |
A logical vector indicating whether the final merged dataset should be overwritten. The default is FALSE. |
save_database |
logical. Should the standardized database be locally saved? Default = FALSE. |
bdc_standardize_datasets()
facilitate the standardization of datasets with
different column names by converting them into a new dataset following the
Darwin Core terminology. The standardization process relies on a metadata
file containing the name, path, and columns that need to be renamed. The
metadata file can be constructed using built-in functions (e.g.,
data.frame()
) or storing the information in a CSV file and importing it
into R. Regardless of the method chosen, the data frame with metadata needs
to contain the following column names (this is a list of required column
names; for a comprehensive list of column names following Darwin Core
terminology, see
here
datasetName
: A short name identifying the dataset (e.g., GBIF)
fileName
: The relative path containing the name of the input dataset
(e.g., Input_files/GBIF.csv)
scientificName
: Name of the column in the original database presenting
the taxon scientific names with or without authorship information, depending
on the format of the source dataset (e.g., Myrcia acuminata)
decimalLatitude
: Name of the column in the original database presenting
the geographic latitude in decimal degrees (e.g., -6.370833)
decimalLongitude
: Name of the column in the original database presenting
the geographic longitude in decimal degrees (e.g., -3.25500)
A merged data.frame with column names following Darwin Core terminology.
## Not run: metadata <- readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv", package = "bdc")) db_standardized <- bdc_standardize_datasets( metadata = metadata, format = "csv", overwrite = TRUE, save_database = FALSE) ## End(Not run)
## Not run: metadata <- readr::read_csv(system.file("extdata/Config/DatabaseInfo.csv", package = "bdc")) db_standardized <- bdc_standardize_datasets( metadata = metadata, format = "csv", overwrite = TRUE, save_database = FALSE) ## End(Not run)
This function creates or updates the column ".summary" summarizing the results of data quality tests (i.e., columns starting with "."). Records that have failed in at least one test are flagged for further inspection (i.e., flagged as "FALSE") in the ".summary" column.
bdc_summary_col(data)
bdc_summary_col(data)
data |
data.frame. Containing the results of data quality tests (i.e., columns starting with "."). |
If existing, the column ".summary" will be removed and then updated considering all test names available in the supplied database.
A data.frame containing a new or an updated column ".summary".
.missing_names <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .missing_coordinates <- c(TRUE, FALSE, FALSE, TRUE, FALSE) x <- data.frame(.missing_names, .missing_coordinates) bdc_summary_col(data = x)
.missing_names <- c(TRUE, TRUE, TRUE, FALSE, FALSE) .missing_coordinates <- c(TRUE, FALSE, FALSE, TRUE, FALSE) x <- data.frame(.missing_names, .missing_coordinates) bdc_summary_col(data = x)
This function extracts a four-digit year from unambiguously interpretable collecting dates.
bdc_year_from_eventDate(data, eventDate = "eventDate")
bdc_year_from_eventDate(data, eventDate = "eventDate")
data |
A data frame containing a column with event date information. |
eventDate |
Numeric or date. The column with event date information. |
A data.frame containing the column "year". Year information is returned only if "eventDate" can be unambiguously interpretable from "eventDate". Years in the future (e.g., 2050) are returned as NA as well as years before 1600, which is the lower limit for collecting dates of biological specimens.
Other time:
bdc_eventDate_empty()
,
bdc_year_outOfRange()
collection_date <- c( NA, "31/12/2015", "2013-06-13T00:00:00Z", "2019-05-20", "", "2013", "0001-01-00", "20", "1200" ) x <- data.frame(collection_date) bdc_year_from_eventDate(data = x, eventDate = "collection_date")
collection_date <- c( NA, "31/12/2015", "2013-06-13T00:00:00Z", "2019-05-20", "", "2013", "0001-01-00", "20", "1200" ) x <- data.frame(collection_date) bdc_year_from_eventDate(data = x, eventDate = "collection_date")
This function identifies records out-of-range collecting year (e.g., in the future) or old records collected before a year informed in 'year_threshold'.
bdc_year_outOfRange(data, eventDate, year_threshold = 1900)
bdc_year_outOfRange(data, eventDate, year_threshold = 1900)
data |
A data frame containing a column with event date information. |
eventDate |
numeric or date. The column containing event date information. |
year_threshold |
numeric. A four-digit year threshold used to flag old (potentially invalid) records. Default = 1900 |
Following the "VALIDATION:YEAR_OUTOFRANGE" Biodiversity data quality group, the results of this test are time-dependent. While the user may provide a lower limit to the year, the upper limit is defined based on the year when the test is run. Lower limits can be used to flag old, often imprecise, records. For example, records collected before GPS advent (1980). If 'year_threshold' is not provided, the lower limit to the year is by default 1600, a lower limit for collecting dates of biological specimens. Records with empty or NA 'eventDate' are not tested and returned as NA.
A data.frame containing the column ".year_outOfRange". Compliant (TRUE) if 'eventDate' is not out-of-range; otherwise "FALSE".
Other time:
bdc_eventDate_empty()
,
bdc_year_from_eventDate()
collection_date <- c( NA, "31/12/2029", "2013-06-13T00:00:00Z", "2013-06-20", "", "2013", 1650, "0001-01-00" ) x <- data.frame(collection_date) bdc_year_outOfRange( data = x, eventDate = "collection_date", year_threshold = 1900)
collection_date <- c( NA, "31/12/2029", "2013-06-13T00:00:00Z", "2013-06-20", "", "2013", 1650, "0001-01-00" ) x <- data.frame(collection_date) bdc_year_outOfRange( data = x, eventDate = "collection_date", year_threshold = 1900)