--- title: "Space" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Space} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", echo = TRUE, warning = FALSE, eval = T ) ``` ```{r, echo = FALSE, eval = TRUE, include = FALSE, messages = FALSE} library(bdc) ``` #### **Introduction** To avoid reinventing the wheel, the proposed data-cleaning workflow uses functions for other existing packages. In this module, we used one test of the *bdc* and other tests of the *R* package [CoordinateCleaner](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13152) **to flag potentially erroneous, suspect, or imprecise geographical coordinates** based on geographic gazetteers and metadata. It includes a series of tests for identifying records assigned to capitals, provinces, and country centroids, coordinates in urban areas, around biodiversity institutions, or GBIF headquarters. It also contains tests to flag coordinates below a determined precision (e.g., 100 km), zero or equal coordinates, and duplicated records (i.e., equal taxa name and coordinates). Note that we do not use the "seas" test to remove records in the ocean because such records we previously removed in the pre-filter module of the package (more details [here](https://brunobrr.github.io/bdc/articles/prefilter.html)). ⚠️**IMPORTANT:** The results of the VALIDATION test used to flag data quality are appended in separate fields in this database and retrieved as TRUE (✅ ok) or FALSE (❌check carefully). #### **Installation** Check [**here**](https://brunobrr.github.io/bdc/#installation) how to install the bdc package. #### Reading the database Reading the database created in the [**taxonomy**](https://brunobrr.github.io/bdc/articles/taxonomy.html) module the *bdc* package. It is also possible to read any datasets containing the **required** fields to run the function (more details [here](https://brunobrr.github.io/bdc/articles/integrate_datasets.html)). ```{r echo=TRUE, eval=FALSE} database <- readr::read_csv(here::here("Output/Intermediate/02_taxonomy_database.csv")) ``` ```{r echo=FALSE, eval=TRUE} database <- readr::read_csv(system.file("extdata/outpus_vignettes/02_taxonomy_database.csv", package = "bdc"), show_col_types = FALSE) ``` #### **Flagging common spatial issues** This function identifyes records with a coordinate precision below a specified number of decimal places. For example, the precision of a coordinate with 1 decimal place is 11.132 km at the equator, i.e., the scale of a large city. ```{r eval=FALSE} check_space <- bdc_coordinates_precision( data = database, lon = "decimalLongitude", lat = "decimalLatitude", ndec = c(0, 1) # number of decimals to be tested ) #> bdc_coordinates_precision: #> Flagged 2 records #> One column was added to the database. ``` Next, we will flag common spatial issues using functions of the package [CoordinateCleaner](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13152). ```{r eval=FALSE} check_space <- CoordinateCleaner::clean_coordinates( x = check_space, lon = "decimalLongitude", lat = "decimalLatitude", species = "scientificName", countries = , tests = c( "capitals", # records within 2km around country and province centroids "centroids", # records within 1km of capitals centroids "duplicates", # duplicated records "equal", # records with equal coordinates "gbif", # records within 1 degree (~111km) of GBIF headsquare "institutions", # records within 100m of zoo and herbaria "outliers", # outliers "zeros", # records with coordinates 0,0 "urban" # records within urban areas ), capitals_rad = 2000, centroids_rad = 1000, centroids_detail = "both", # test both country and province centroids inst_rad = 100, # remove zoo and herbaria within 100m outliers_method = "quantile", outliers_mtp = 5, outliers_td = 1000, outliers_size = 10, range_rad = 0, zeros_rad = 0.5, capitals_ref = NULL, centroids_ref = NULL, country_ref = NULL, country_refcol = "countryCode", inst_ref = NULL, range_ref = NULL, # seas_ref = continent_border, # seas_scale = 110, urban_ref = NULL, value = "spatialvalid" # result of tests are appended in separate columns ) #> Testing coordinate validity #> Flagged 0 records. #> Testing equal lat/lon #> Flagged 0 records. #> Testing zero coordinates #> Flagged 0 records. #> Testing country capitals #> Flagged 0 records. #> Testing country centroids #> Flagged 0 records. #> Testing urban areas #> Downloading urban areas via rnaturalearth #> trying URL 'http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/ne_50m_urban_areas.zip' #> Content type 'application/zip' length 452644 bytes (442 KB) #> downloaded 442 KB #> #> OGR data source with driver: ESRI Shapefile #> Source: "C:\Users\Bruno R. Ribeiro\AppData\Local\Temp\RtmpCo3Lyf", layer: "ne_50m_urban_areas" #> with 2143 features #> It has 4 fields #> Integer64 fields read as strings: scalerank #> Flagged 6 records. #> Testing geographic outliers #> Testing GBIF headquarters, flagging records around Copenhagen #> Flagged 0 records. #> Testing biodiversity institutions #> Flagged 0 records. #> Testing duplicates #> Flagged 0 records. #> Flagged 6 of 115 records, EQ = 0.05. ``` Here we update the column named **.summary** summing up the results of all tests. This column is **FALSE** if any test was flagged as "FALSE" (❌check carefully; potentially invalid or suspect record). ```{r eval=FALSE} check_space <- bdc_summary_col(data = check_space) #> Column '.summary' already exist. It will be updated #> #> bdc_summary_col: #> Flagged 9 records. #> One column was added to the database. ``` ```{r echo=FALSE, eval=TRUE} check_space <- readr::read_csv(system.file("extdata/outpus_vignettes/03_space_database.csv", package = "bdc"), show_col_types = FALSE) ``` ```{r echo=FALSE, message=FALSE, warning=FALSE, eval=TRUE} DT::datatable( check_space[1:15,], class = 'stripe', extensions = 'FixedColumns', rownames = FALSE, options = list( pageLength = 5, dom = 'Bfrtip', scrollX = TRUE, fixedColumns = list(leftColumns = 2) ) ) ``` #### **Mapping spatial errors** It is possible to map a column containing the results of **one spatial test each time**, for example records in country centroids (column ".cen"). Besides, we can use the column **".summary"** to map all records flagged as potentially problematic (i.e., FALSE). ```{r eval=FALSE} check_space %>% dplyr::filter(.summary == FALSE) %>% # map only records flagged as FALSE bdc_quickmap( data = ., lon = "decimalLongitude", lat = "decimalLatitude", col_to_map = ".summary", size = 0.9 ) ``` ![Records flagged as potentially problematic](https://raw.githubusercontent.com/brunobrr/bdc/master/vignettes/images/map_summary_space_vignette.png) #### **Report** Creating a report summarizing the results of all tests. The report can be automatically saved if `save_report = TRUE.` ```{r eval=FALSE} report <- bdc_create_report(data = check_space, database_id = "database_id", workflow_step = "space", save_report = FALSE) report ``` ```{r echo=FALSE, eval=TRUE} report <- readr::read_csv( system.file("extdata/outpus_vignettes/03_Report_space.csv", package = "bdc"), show_col_types = FALSE ) ``` ```{r echo=FALSE, message=FALSE, warning=FALSE, eval=TRUE} DT::datatable( report, class = 'stripe', extensions = 'FixedColumns', rownames = FALSE, options = list( # pageLength = 5, dom = 'Bfrtip', scrollX = TRUE, fixedColumns = list(leftColumns = 2) ) ) ``` #### **Figures** Here we create figures (bar plots, maps, and histograms) to make the interpretation of the results of data quality tests easier. Figures can be automatically saved if `save_figures = TRUE.` ```{r, eval=FALSE} figures <- bdc_create_figures(data = check_space, database_id = "database_id", workflow_step = "space", save_figures = TRUE) # Check figures using figures$.rou ``` ![Rounded coordinates](https://raw.githubusercontent.com/brunobrr/bdc/master/vignettes/images/space_.rou_BAR.png) ![Records within urban areas](https://raw.githubusercontent.com/brunobrr/bdc/master/vignettes/images/space_.urb_MAP.png) ![Summary of all tests](https://raw.githubusercontent.com/brunobrr/bdc/master/vignettes/images/space_summary_all_tests_BAR.png) #### **Filtering the database** It is possible to remove flagged records (potentially problematic ones) to get a 'clean' database (i.e., without test columns starting with "."). However, to ensure that all records will be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal module of the package), potentially erroneous or suspect records will be removed in the final module of the package. ```{r} # output <- # check_space %>% # dplyr::filter(.summary == TRUE) %>% # bdc_filter_out_flags(data = ., col_to_remove = "all") ``` #### **Saving the database** You can use [qs::qsave()]{.underline} instead of write_csv to save a large database in a compressed format. ```{r eval=FALSE} # use qs::qsave() to save the database in a compressed format and then qs:qread() to load the database check_space %>% readr::write_csv(., here::here("Output", "Intermediate", "03_space_database.csv")) ```