| Title: | Prepare, Manipulate and Check Data to Comply with Darwin Core Standard |
|---|---|
| Description: | Helps users standardise data to the Darwin Core Standard, a global data standard to store, document, and share biodiversity data like species occurrence records. The package provides tools to manipulate data to conform with, and check validity against, the Darwin Core Standard. Using 'corella' allows users to verify that their data can be used to build 'Darwin Core Archives' using the 'galaxias' package. |
| Authors: | Dax Kellie [aut, cre], Shandiya Balasubramanium [aut], Martin Westgate [aut] |
| Maintainer: | Dax Kellie <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.4 |
| Built: | 2026-05-15 09:32:02 UTC |
| Source: | https://github.com/AtlasOfLivingAustralia/corella |
When creating a Darwin Core Archive, several fields have a vocabulary of acceptable values. These functions provide a vector of terms that can be used to fill or validate those fields.
basisOfRecord_values() countryCode_values()basisOfRecord_values() countryCode_values()
A vector of accepted values for that use case.
occurrence_terms() or event_terms() for valid Darwin Core
terms (i.e. column names).
# See all valid basis of record values basisOfRecord_values()# See all valid basis of record values basisOfRecord_values()
Run a test suite of checks to test whether a data.frame or tibble
conforms to Darwin Core Standard.
While most users will only want to call suggest_workflow(),
the underlying check functions are exported for detailed work, or for
debugging. This function is useful for users experienced with
Darwin Core Standard or for final dataset checks.
check_dataset(.df)check_dataset(.df)
.df |
A tibble against which checks should be run |
check_dataset() is modelled after devtools::test(). It runs a
series of checks, then supplies a summary of passed/failed checks and
error messages.
Checks run by check_dataset() are the same that would
be run automatically by various set_ functions in a piped workflow. This
function allows users with only minor expected updates to check their entire
dataset without the need for set_ functions.
Invisibly returns the input data frame, but primarily called for the side-effect of running check functions on that input.
df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), status = c("present", "present", "present") ) # Run a test suite of checks for Darwin Core Standard conformance # Checks are only run on columns with names that match Darwin Core terms df |> check_dataset()df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), status = c("present", "present", "present") ) # Run a test suite of checks for Darwin Core Standard conformance # Checks are only run on columns with names that match Darwin Core terms df |> check_dataset()
A unique identifier is a pattern of words, letters and/or numbers that is unique to a single record within a dataset. Unique identifiers are useful because they identify individual observations, and make it possible to change, amend or delete observations over time. They also prevent accidental deletion when when more than one record contains the same information(and would otherwise be considered a duplicate).
The identifier functions in corella make it easier to
generate columns with unique identifiers in a dataset. These functions can
be used within set_events(), set_occurrences(), or (equivalently)
dplyr::mutate().
composite_id(..., sep = "-") sequential_id(width) random_id()composite_id(..., sep = "-") sequential_id(width) random_id()
... |
Zero or more variable names from the tibble being
mutated (unquoted), and/or zero or more |
sep |
Character used to separate field values. Defaults to |
width |
(Integer) how many characters should the resulting string be? Defaults to one plus the order of magnitude of the largest number. |
Generally speaking, it is better to use existing
information from a dataset to generate identifiers. For this reason we
recommend using composite_id() to aggregate existing fields, if no
such composite is already present within the dataset. Composite IDs are
more meaningful and stable; they are easier to check and harder to overwrite.
It is possible to call
sequential_id() or random_id() within
composite_id() to combine existing and new columns.
An amended tibble containing a column with identifiers in the
requested format.
df <- tibble::tibble( eventDate = paste0(rep(c(2020:2024), 3), "-01-01"), basisOfRecord = "humanObservation", site = rep(c("A01", "A02", "A03"), each = 5) ) # Add composite ID using a random ID, site name and eventDate df |> set_occurrences( occurrenceID = composite_id(random_id(), site, eventDate) ) # Add composite ID using a sequential number, site name and eventDate df |> set_occurrences( occurrenceID = composite_id(sequential_id(), site, eventDate) )df <- tibble::tibble( eventDate = paste0(rep(c(2020:2024), 3), "-01-01"), basisOfRecord = "humanObservation", site = rep(c("A01", "A02", "A03"), each = 5) ) # Add composite ID using a random ID, site name and eventDate df |> set_occurrences( occurrenceID = composite_id(random_id(), site, eventDate) ) # Add composite ID using a sequential number, site name and eventDate df |> set_occurrences( occurrenceID = composite_id(sequential_id(), site, eventDate) )
A tibble of ISO 3166-1 alpha-2 codes for countries, which are
the accepted standard for supplying countryCode in Darwin Core Standard.
country_codescountry_codes
A tibble containing valid country codes (249 rows x 3 columns).
Column descriptions are as follows:
ISO 3166-1 alpha-2 code, pointing to its ISO 3166-2 article.
English short name officially used by the ISO 3166 Maintenance Agency (ISO 3166/MA).
Year when alpha-2 code was first officially assigned.
set_locality() for assigning countryCode within a tibble;
countryCode_values() to return valid codes as a vector.
The Darwin Core Standard is maintained by Biodiversity Information Standards,
previously known as the Taxonomic Databases Working Group and known by the
acronym 'TDWG'. This tibble is the full list of supported terms,
current as of 2024-12-10.
Users can use occurrence_terms() and event_terms() as convenience
functions to access these terms.
darwin_core_termsdarwin_core_terms
A tibble containing valid Darwin Core Standard terms (206 rows x 6 columns).
Column descriptions are as follows:
TDWG group that a term belongs to.
Column header names that can be used in Darwin Core
Stable url to information describing the term.
Human-readable definition of the term.
Further information from TDWG.
Examples of how the field should be populated.
Function in corella that supports Darwin Core term.
Slightly modified version of a table supplied by TDWG.
occurrence_terms() and event_terms() to get terms for use in
dplyr::select()
When creating a Darwin Core archive, it is often useful to select only those
fields that conform to the standard. These functions provide a vector of
terms that can be used in combination with dplyr::select() and
dplyr::any_of() to quickly select Darwin Core terms for the relevant
data type (events, occurrences, media).
occurrence_terms() event_terms()occurrence_terms() event_terms()
A vector of accepted (but not mandatory) values for that use case.
basisOfRecord_values() or countryCode_values() for valid entries
within a field.
# Return a vector of accepted terms in an Occurrence-based dataset occurrence_terms() |> head(10L) # first 10 terms # Use this vector to filter a data frame df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), measurement1 = c(24.3, 24.9, 20.1), # example measurement column measurement2 = c(0.92, 1.03, 1.09) # example measurement column ) df |> dplyr::select(any_of(occurrence_terms()))# Return a vector of accepted terms in an Occurrence-based dataset occurrence_terms() |> head(10L) # first 10 terms # Use this vector to filter a data frame df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), measurement1 = c(24.3, 24.9, 20.1), # example measurement column measurement2 = c(0.92, 1.03, 1.09) # example measurement column ) df |> dplyr::select(any_of(occurrence_terms()))
In some field methods, it is common to observe more than one individual
per observation; to observe abundance using non-integer measures such as
mass or area; or to seek individuals but not find them (abundance of zero).
As these approaches use different Darwin Core terms, this function assists in
specifying abundances to a tibble using Darwin Core Standard.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for how columns with
abundance information are represented in the Darwin Core Standard.
set_abundance( .df, individualCount = NULL, organismQuantity = NULL, organismQuantityType = NULL, .keep = "unused" )set_abundance( .df, individualCount = NULL, organismQuantity = NULL, organismQuantityType = NULL, .keep = "unused" )
.df |
A |
individualCount |
The number of individuals present |
organismQuantity |
A number or enumeration value for the quantity of
organisms. Used together with |
organismQuantityType |
The type of quantification system used for
|
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Examples of organismQuantity & organismQuantityType values:
27 (organismQuantity) individuals (organismQuantityType)
12.5 (organismQuantity) % biomass (organismQuantityType)
r (organismQuantity) Braun-Blanquet Scale (organismQuantityType)
many (organismQuantity) individuals (organismQuantityType)
A tibble with the requested fields added/reformatted.
df <- tibble::tibble( scientificName = c("Cacatua (Licmetis) tenuirostris", "Cacatua (Licmetis) tenuirostris", "Cacatua (Licmetis) tenuirostris"), n_obs = c(1, 3, 4) ) df |> set_abundance(individualCount = n_obs)df <- tibble::tibble( scientificName = c("Cacatua (Licmetis) tenuirostris", "Cacatua (Licmetis) tenuirostris", "Cacatua (Licmetis) tenuirostris"), n_obs = c(1, 3, 4) ) df |> set_abundance(individualCount = n_obs)
Format fields that specify the collection or catalog number of a
specimen or occurrence record to a tibble using Darwin Core Standard.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_collection( .df, datasetID = NULL, datasetName = NULL, catalogNumber = NULL, .keep = "unused" )set_collection( .df, datasetID = NULL, datasetName = NULL, catalogNumber = NULL, .keep = "unused" )
.df |
A |
datasetID |
An identifier for the set of data. May be a global unique identifier or an identifier specific to a collection or institution. |
datasetName |
The name identifying the data set from which the record was derived. |
catalogNumber |
A unique identifier for the record within the data set or collection. |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Examples of datasetID values:
b15d4952-7d20-46f1-8a3e-556a512b04c5
Examples of datasetName values:
Grinnell Resurvey Mammals
Lacey Ctenomys Recaptures
Examples of catalogNumber values:
145732
145732a
2008.1334
R-4313
A tibble with the requested fields added/reformatted.
df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), catalog_num = c("16789a", "16789c", "08742f"), dataset = c("Frog search", "Frog search", "Frog search") ) # Reformat columns to Darwin Core terms df |> set_collection( catalogNumber = catalog_num, datasetName = dataset )df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), catalog_num = c("16789a", "16789c", "08742f"), dataset = c("Frog search", "Frog search", "Frog search") ) # Reformat columns to Darwin Core terms df |> set_collection( catalogNumber = catalog_num, datasetName = dataset )
This function helps format standard location fields like
latitude and longitude point coordinates to a tibble using Darwin Core
Standard.
set_coordinates( .df, decimalLatitude = NULL, decimalLongitude = NULL, geodeticDatum = NULL, coordinateUncertaintyInMeters = NULL, coordinatePrecision = NULL, .keep = "unused" )set_coordinates( .df, decimalLatitude = NULL, decimalLongitude = NULL, geodeticDatum = NULL, coordinateUncertaintyInMeters = NULL, coordinatePrecision = NULL, .keep = "unused" )
.df |
A |
decimalLatitude |
The latitude in decimal degrees. |
decimalLongitude |
The longitude in decimal degrees. |
geodeticDatum |
The datum or spatial reference system that coordinates are recorded against (usually "WGS84" or "EPSG:4326"). This is often known as the Coordinate Reference System (CRS). If your coordinates are from a GPS system, your data are already using WGS84. |
coordinateUncertaintyInMeters |
(numeric) Radius of the smallest circle
that contains the whole location, given any possible measurement error.
|
coordinatePrecision |
(numeric) The precision that |
.keep |
Control which columns from |
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for how spatial columns are
represented in the Darwin Core Standard.
Example values are:
geodeticDatum should be a valid EPSG code
A tibble with the requested columns added/reformatted.
set_locality() for provided text-based spatial information.
df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Reformat columns to Darwin Core Standard terms df |> set_coordinates( decimalLongitude = longitude, decimalLatitude = latitude )df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Reformat columns to Darwin Core Standard terms df |> set_coordinates( decimalLongitude = longitude, decimalLatitude = latitude )
sf spatial informationThis function helps format standard location fields like longitude and
latitude point coordinates to a tibble using Darwin Core Standard.
It differs from set_coordinates() by accepting sf geometry columns of
class POINTas coordinates (rather than numeric lat/lon coordinates).
The advantage
of using an sf geometry is that the Coordinate Reference System (CRS) is
automatically formatted into the required geodeticDatum column.
set_coordinates_sf(.df, geometry = NULL, .keep = "unused")set_coordinates_sf(.df, geometry = NULL, .keep = "unused")
.df |
A |
geometry |
The latitude/longitude coordinates as |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
A tibble with the requested columns added/reformatted.
set_coordinates() for providing numeric coordinates,
set_locality() for providing text-based spatial information.
df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) |> sf::st_as_sf(coords = c("longitude", "latitude")) |> sf::st_set_crs(4326) # Reformat columns to Darwin Core Standard terms. # Coordinates and CRS are automatically detected and reformatted. df |> set_coordinates_sf()df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) |> sf::st_as_sf(coords = c("longitude", "latitude")) |> sf::st_set_crs(4326) # Reformat columns to Darwin Core Standard terms. # Coordinates and CRS are automatically detected and reformatted. df |> set_coordinates_sf()
This function helps format standard date/time columns in a tibble using
Darwin Core Standard. Users should make use of the
lubridate package to
format their dates so corella can read them correctly.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for how spatial fields are
represented in the Darwin Core Standard.
set_datetime( .df, eventDate = NULL, year = NULL, month = NULL, day = NULL, eventTime = NULL, .keep = "unused", .messages = TRUE )set_datetime( .df, eventDate = NULL, year = NULL, month = NULL, day = NULL, eventTime = NULL, .keep = "unused", .messages = TRUE )
.df |
A |
eventDate |
The date or date + time that the observation/event occurred. |
year |
The year of the observation/event. |
month |
The month of the observation/event. |
day |
The day of the observation/event. |
eventTime |
The time of the event. Use this term for Event data.
Date + time information for observations is accepted in |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
.messages |
(logical) Should informative messages be shown? Defaults to
|
Example values are:
eventDate should be class Date or POSITct. We suggest using the
lubridate package to define define your date format using functions like
ymd(), mdy, dmy(), or if including date + time, ymd_hms(),
ymd_hm(), or ymd_h().
A tibble with the requested columns added/reformatted.
df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), date = c("2010-10-14", "2010-10-14", "2010-10-14"), time = c("10:08:12", "13:01:45", "14:02:33") ) # Use the lubridate package to format date + time information # eventDate accepts date + time df |> set_datetime( eventDate = lubridate::ymd_hms(paste(date, time)) )df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), date = c("2010-10-14", "2010-10-14", "2010-10-14"), time = c("10:08:12", "13:01:45", "14:02:33") ) # Use the lubridate package to format date + time information # eventDate accepts date + time df |> set_datetime( eventDate = lubridate::ymd_hms(paste(date, time)) )
Identify or format columns that contain information about an Event. An "Event" in Darwin Core Standard refers to an action that occurs at a place and time. Examples include:
A specimen collecting event
A survey or sampling event
A camera trap image capture
A marine trawl
A camera trap deployment event
A camera trap burst image event (with many images for one observation)
In practice this function is used no differently from mutate(), but gives
users some informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_events( .df, eventID = NULL, eventType = NULL, parentEventID = NULL, .keep = "unused", .keep_composite = "all" )set_events( .df, eventID = NULL, eventType = NULL, parentEventID = NULL, .keep = "unused", .keep_composite = "all" )
.df |
A |
eventID |
A unique identifier for an individual Event. |
eventType |
The type of Event |
parentEventID |
The parent event under which one or more Events sit within. |
.keep |
Control which columns from |
.keep_composite |
Control which columns from |
Each Event requires a unique eventID and eventType (because there can
be several types of Events in a single dataset), along with a
parentEventID which specifies the level under which the current Event sits
(e.g., An individual location's survey event ID, which is one of several
survey locations on a specific day's set of surveys ie the parentEvent).
Examples of eventID values:
INBO:VIS:Ev:00009375
Examples of eventType values:
Sample
Observation
Survey
Site Visit
Deployment
See more examples on dwc.tdwg.org
Examples of parentEventID
A1 (To identify the parent event in nested samples, each with their own eventID - A1_1, A1_2)
A tibble with the requested fields added/reformatted.
# example Event dataframe df <- tibble::tibble( site_code = c("AMA100", "AMA100", "AMH100"), scientificName = c("Crinia signifera", "Crinia signifera", "Crinia signifera"), latitude = c(-35.275, -35.274, -35.101), longitue = c(149.001, 149.004, 149.274), ) # Add event information df |> set_events( eventID = composite_id(sequential_id(), site_code, year), eventType = "Survey" )# example Event dataframe df <- tibble::tibble( site_code = c("AMA100", "AMA100", "AMH100"), scientificName = c("Crinia signifera", "Crinia signifera", "Crinia signifera"), latitude = c(-35.275, -35.274, -35.101), longitue = c(149.001, 149.004, 149.274), ) # Add event information df |> set_events( eventID = composite_id(sequential_id(), site_code, year), eventType = "Survey" )
Format fields that contain measurements or attributes of individual
organisms to a tibble using Darwin Core Standard. Fields include those
that specify sex, life stage or condition. Individuals can be identified by
an individualID if data contains resampling.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_individual_traits( .df, individualID = NULL, lifeStage = NULL, sex = NULL, vitality = NULL, reproductiveCondition = NULL, .keep = "unused" )set_individual_traits( .df, individualID = NULL, lifeStage = NULL, sex = NULL, vitality = NULL, reproductiveCondition = NULL, .keep = "unused" )
.df |
A |
individualID |
An identifier for an individual or named group of individual organisms represented in the Occurrence. Meant to accommodate resampling of the same individual or group for monitoring purposes. May be a global unique identifier or an identifier specific to a data set. |
lifeStage |
The age class or life stage of an organism at the time of occurrence. |
sex |
The sex of the biological individual. |
vitality |
An indication of whether an organism was alive or dead at the time of collection or observation. |
reproductiveCondition |
The reproductive condition of the biological individual. |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Examples of lifeStage values:
zygote
larva
adult
seedling
flowering
Examples of vitality values:
alive
dead
uncertain
Examples of reproductiveCondition values:
non-reproductive
pregnant
in bloom
fruit bearing
A tibble with the requested fields added/reformatted.
set_scientific_name() for adding scientificName and authorship information.
df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), id = c(4421, 4422, 3311), life_stage = c("juvenile", "adult", "adult") ) # Reformat columns to Darwin Core Standard df |> set_individual_traits( individualID = id, lifeStage = life_stage )df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), id = c(4421, 4422, 3311), life_stage = c("juvenile", "adult", "adult") ) # Reformat columns to Darwin Core Standard df |> set_individual_traits( individualID = id, lifeStage = life_stage )
Format fields that contain information on permissions for use, sharing or
access to a record to a tibble using Darwin Core Standard.
In practice this function is no different from using mutate(), but gives
some informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_license( .df, license = NULL, rightsHolder = NULL, accessRights = NULL, .keep = "unused" )set_license( .df, license = NULL, rightsHolder = NULL, accessRights = NULL, .keep = "unused" )
.df |
A |
license |
A legal document giving official permission to do something with the resource. Must be provided as a url to a valid license. |
rightsHolder |
Person or organisation owning or managing rights to resource. |
accessRights |
Access or restrictions based on privacy or security. |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Examples of license values:
http://creativecommons.org/publicdomain/zero/1.0/legalcode
http://creativecommons.org/licenses/by/4.0/legalcode
CC0
CC-BY-NC 4.0 (Int)
Examples of rightsHolder values:
The Regents of the University of California
Examples of accessRights values:
not-for-profit use only (string example)
https://www.fieldmuseum.org/field-museum-natural-history-conditions-and-suggested-norms-use-collections-data-and-images (URI example)
A tibble with the requested fields added/reformatted.
set_observer() for adding observer information.
df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), attributed_license = c("CC-BY-NC 4.0 (Int)", "CC-BY-NC 4.0 (Int)", "CC-BY-NC 4.0 (Int)") ) # Reformat columns to Darwin Core Standard df |> set_license( license = attributed_license )df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), attributed_license = c("CC-BY-NC 4.0 (Int)", "CC-BY-NC 4.0 (Int)", "CC-BY-NC 4.0 (Int)") ) # Reformat columns to Darwin Core Standard df |> set_license( license = attributed_license )
Locality information refers to a description of a place, rather than a
spatial coordinate. This function helps to format columns
with locality information to a tibble using Darwin Core Standard.
In practice this is used no differently from mutate(), but gives some
informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_locality( .df, continent = NULL, country = NULL, countryCode = NULL, stateProvince = NULL, locality = NULL, .keep = "unused" )set_locality( .df, continent = NULL, country = NULL, countryCode = NULL, stateProvince = NULL, locality = NULL, .keep = "unused" )
.df |
A |
continent |
(string) Valid continent. See details. |
country |
Valid country name. See |
countryCode |
Valid country code. See |
stateProvince |
A sub-national region. |
locality |
A specific description of a location or place. |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Values of continent should be one of "Africa", "Antarctica", "Asia",
"Europe", "North America", "Oceania" or "South America".
countryCode should be supplied according to the
ISO 3166-1 ALPHA-2
standard, as per TDWG advice.
Examples of countryCode:
AUS
NZ
BRA
Examples of locality:
Bariloche, 25 km NNE via Ruta Nacional 40 (=Ruta 237)
Queets Rainforest, Olympic National Park
A tibble with the requested columns added/reformatted.
set_coordinates() for numeric spatial data.
df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), countryCode = c("AU", "AU", "AU"), state = c("New South Wales", "New South Wales", "New South Wales"), locality = c("Melville Caves", "Melville Caves", "Bryans Swamp about 3km away") ) # Reformat columns to Darwin Core Standard terms df |> set_locality( countryCode = countryCode, stateProvince = state, locality = locality ) # Columns with valid Darwin Core terms as names are automatically detected # and checked. This will do the same as above. df |> set_locality( stateProvince = state )df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), countryCode = c("AU", "AU", "AU"), state = c("New South Wales", "New South Wales", "New South Wales"), locality = c("Melville Caves", "Melville Caves", "Bryans Swamp about 3km away") ) # Reformat columns to Darwin Core Standard terms df |> set_locality( countryCode = countryCode, stateProvince = state, locality = locality ) # Columns with valid Darwin Core terms as names are automatically detected # and checked. This will do the same as above. df |> set_locality( stateProvince = state )
This function is a work in progress, and should be used with caution.
In raw collected data, many types of information can be captured in one
column. For example, the column name LMA_g.m2 contains the measured trait
(Leaf Mass per Area, LMA) and the unit of measurement (grams per meter
squared, g/m2), and recorded in that column are the values themselves. In
Darwin Core, these different types of information must be separated into
multiple columns so that they can be ingested correctly and aggregated with
sources of data accurately.
This function converts information preserved in a single measurement column
into multiple columns (measurementID, measurementUnit, and
measurementType) as per Darwin Core standard.
set_measurements(.df, cols = NULL, unit = NULL, type = NULL, .keep = "unused")set_measurements(.df, cols = NULL, unit = NULL, type = NULL, .keep = "unused")
.df |
a |
cols |
vector of column names to be included as 'measurements'. Unquoted. |
unit |
vector of strings giving units for each variable |
type |
vector of strings giving a description for each variable |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Columns are nested in a
single column measurementOrFact that contains Darwin Core Standard
measurement fields. By nesting three measurement columns within the
measurementOrFact column, nested measurement columns can be converted to
long format (one row per measurement, per occurrence) while the original
data frame remains organised by one row per occurrence. Data
can be unnested into long format using tidyr::unnest().
A tibble with the requested fields added.
library(tidyr) # Example data of plant species observations and measurements df <- tibble::tibble( Site = c("Adelaide River", "Adelaide River", "AgnesBanks"), Species = c("Corymbia latifolia", "Banksia aemula", "Acacia aneura"), Latitude = c(-13.04, -13.04, -33.60), Longitude = c(131.07, 131.07, 150.72), LMA_g.m2 = c(NA, 180.07, 159.01), LeafN_area_g.m2 = c(1.100, 0.913, 2.960) ) # Reformat columns to Darwin Core Standard # Measurement columns are reformatted and nested in column `measurementOrFact` df_dwc <- df |> set_measurements( cols = c(LMA_g.m2, LeafN_area_g.m2), unit = c("g/m2", "g/m2"), type = c("leaf mass per area", "leaf nitrogen per area") ) df_dwc # Unnest to view full long format data frame df_dwc |> tidyr::unnest(measurementOrFact)library(tidyr) # Example data of plant species observations and measurements df <- tibble::tibble( Site = c("Adelaide River", "Adelaide River", "AgnesBanks"), Species = c("Corymbia latifolia", "Banksia aemula", "Acacia aneura"), Latitude = c(-13.04, -13.04, -33.60), Longitude = c(131.07, 131.07, 150.72), LMA_g.m2 = c(NA, 180.07, 159.01), LeafN_area_g.m2 = c(1.100, 0.913, 2.960) ) # Reformat columns to Darwin Core Standard # Measurement columns are reformatted and nested in column `measurementOrFact` df_dwc <- df |> set_measurements( cols = c(LMA_g.m2, LeafN_area_g.m2), unit = c("g/m2", "g/m2"), type = c("leaf mass per area", "leaf nitrogen per area") ) df_dwc # Unnest to view full long format data frame df_dwc |> tidyr::unnest(measurementOrFact)
Format fields that contain information about who made a specific observation
of an organism to a tibble using Darwin Core Standard.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_observer(.df, recordedBy = NULL, recordedByID = NULL, .keep = "unused")set_observer(.df, recordedBy = NULL, recordedByID = NULL, .keep = "unused")
.df |
A |
recordedBy |
Names of people, groups, or organizations responsible for recording the original occurrence. The primary collector or observer should be listed first. |
recordedByID |
The globally unique identifier for the person, people, groups, or organizations responsible for recording the original occurrence. |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Examples of recordedBy values:
José E. Crespo
Examples of recordedByID values:
c("https://orcid.org/0000-0002-1825-0097", "https://orcid.org/0000-0002-1825-0098")
A tibble with the requested fields added/reformatted.
df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), observer = c("David Attenborough", "David Attenborough", "David Attenborough") ) # Reformat columns to Darwin Core terms df |> set_observer( recordedBy = observer )df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14"), observer = c("David Attenborough", "David Attenborough", "David Attenborough") ) # Reformat columns to Darwin Core terms df |> set_observer( recordedBy = observer )
Format fields uniquely identify each occurrence record and specify the type
of record. occurrenceID and basisOfRecord are necessary fields of
information for occurrence records, and should be appended to a data set
to conform to Darwin Core Standard prior to submission.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for fields in
the Darwin Core Standard.
set_occurrences( .df, occurrenceID = NULL, basisOfRecord = NULL, occurrenceStatus = NULL, .keep = "unused", .keep_composite = "all", .messages = TRUE )set_occurrences( .df, occurrenceID = NULL, basisOfRecord = NULL, occurrenceStatus = NULL, .keep = "unused", .keep_composite = "all", .messages = TRUE )
.df |
a |
occurrenceID |
A character string. Every occurrence should have an
|
basisOfRecord |
Record type. Only accepts
|
occurrenceStatus |
Either |
.keep |
Control which columns from |
.keep_composite |
Control which columns from |
.messages |
Logical: Should progress message be shown? Defaults to |
Examples of occurrenceID values:
000866d2-c177-4648-a200-ead4007051b9
http://arctos.database.museum/guid/MSB:Mamm:233627
Accepted basisOfRecord values are one of:
"humanObservation", "machineObservation", "livingSpecimen",
"preservedSpecimen", "fossilSpecimen", "materialCitation"
A tibble with the requested columns added/reformatted.
basisOfRecord_values() for accepted values for the basisOfRecord
field'; random_id(), composite_id() or sequential_id() for formatting
ID columns; set_abundance() for occurrence-level counts.
df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Add occurrence information df |> set_occurrences( occurrenceID = composite_id(random_id(), eventDate), # add composite ID basisOfRecord = "humanObservation" )df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Add occurrence information df |> set_occurrences( occurrenceID = composite_id(random_id(), eventDate), # add composite ID basisOfRecord = "humanObservation" )
Format the field scientificName, the lowest identified taxonomic name of an
occurrence, along with the rank and authorship of the provided name to a
tibble using Darwin Core Standard.
set_scientific_name( .df, scientificName = NULL, scientificNameAuthorship = NULL, taxonRank = NULL, .keep = "unused" )set_scientific_name( .df, scientificName = NULL, scientificNameAuthorship = NULL, taxonRank = NULL, .keep = "unused" )
.df |
A |
scientificName |
The full scientific name in the lower level taxonomic rank that can be determined. |
scientificNameAuthorship |
The authorship information for |
taxonRank |
The taxonomic rank of |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
In practice this function is used no differently from mutate(), but gives
users some informative errors, and serves as a useful lookup for accepted
column names in the Darwin Core Standard.
Examples of scientificName values (we specify the rank in parentheses, but
users should not include this information):
Coleoptera (order)
Vespertilionidae (family)
Manis (genus)
Ctenomys sociabilis (genus + specificEpithet)
Ambystoma tigrinum diaboli (genus + specificEpithet + infraspecificEpithet)
Examples of scientificNameAuthorship:
(Györfi, 1952)
R. A. Graham
(Martinovský) Tzvelev
Examples of taxonRank:
order
genus
subspecies
infraspecies
A tibble with the requested columns added/reformatted.
set_taxonomy() for taxonomic name information.
df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Reformat columns to Darwin Core Standard terms df |> set_scientific_name( scientificName = name )df <- tibble::tibble( name = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Reformat columns to Darwin Core Standard terms df |> set_scientific_name( scientificName = name )
Format fields that contain taxonomic name information from kingdom to
species, as well as the common/vernacular name, to a tibble using
Darwin Core Standard.
In practice this is no different from using mutate(), but gives some
informative errors, and serves as a useful lookup for accepted column names in
the Darwin Core Standard.
set_taxonomy( .df, kingdom = NULL, phylum = NULL, class = NULL, order = NULL, family = NULL, genus = NULL, specificEpithet = NULL, vernacularName = NULL, .keep = "unused" )set_taxonomy( .df, kingdom = NULL, phylum = NULL, class = NULL, order = NULL, family = NULL, genus = NULL, specificEpithet = NULL, vernacularName = NULL, .keep = "unused" )
.df |
A |
kingdom |
The kingdom name of identified taxon. |
phylum |
The phylum name of identified taxon. |
class |
The class name of identified taxon. |
order |
The order name of identified taxon. |
family |
The family name of identified taxon. |
genus |
The genus name of the identified taxon. |
specificEpithet |
The name of the first species or species epithet of
the |
vernacularName |
The common or vernacular name of the identified taxon. |
.keep |
Control which columns from .data are retained in the output.
Note that unlike |
Examples of specificEphithet:
If scientificName is Abies concolor, the specificEpithet is concolor.
If scientificName is Semisulcospira gottschei, the specificEpithet is gottschei.
A tibble with the requested columns added/reformatted.
set_scientific_name() for adding scientificName and authorship information.
df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), fam = c("Myobatrachidae", "Myobatrachidae", "Hylidae"), ord = c("Anura", "Anura", "Anura"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Reformat columns to Darwin Core terms df |> set_scientific_name( scientificName = scientificName ) |> set_taxonomy( family = fam, order = ord )df <- tibble::tibble( scientificName = c("Crinia Signifera", "Crinia Signifera", "Litoria peronii"), fam = c("Myobatrachidae", "Myobatrachidae", "Hylidae"), ord = c("Anura", "Anura", "Anura"), latitude = c(-35.27, -35.24, -35.83), longitude = c(149.33, 149.34, 149.34), eventDate = c("2010-10-14", "2010-10-14", "2010-10-14") ) # Reformat columns to Darwin Core terms df |> set_scientific_name( scientificName = scientificName ) |> set_taxonomy( family = fam, order = ord )
Checks whether a data.frame or tibble conforms to Darwin
Core Standard and suggests how to standardise a data frame that is not
standardised to minimum Darwin Core requirements. This is intended as
users' go-to function for figuring out how to get started standardising
their data.
Output provides a summary to users about which column names match valid Darwin Core terms, the minimum required column names/terms (and which ones are missing), and a suggested workflow to add any missing terms.
suggest_workflow(.df)suggest_workflow(.df)
.df |
A |
Invisibly returns the input data.frame/tibble, but primarily
called for the side-effect of running check functions on that input.
df <- tibble::tibble( scientificName = c("Callocephalon fimbriatum", "Eolophus roseicapilla"), latitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes longitude = c(149.125, 149.133), eventDate = c("14-01-2023", "15-01-2023"), status = c("present", "present") ) # Summarise whether your data conforms to Darwin Core Standard. # See a suggested workflow to amend or add missing information. df |> suggest_workflow()df <- tibble::tibble( scientificName = c("Callocephalon fimbriatum", "Eolophus roseicapilla"), latitude = c(-35.310, "-35.273"), # deliberate error for demonstration purposes longitude = c(149.125, 149.133), eventDate = c("14-01-2023", "15-01-2023"), status = c("present", "present") ) # Summarise whether your data conforms to Darwin Core Standard. # See a suggested workflow to amend or add missing information. df |> suggest_workflow()