Metadata and Minimum Information
Metadata and their schemas
Metadata can be described as "data about data", meaning, it is data that describes data, like content of a dataset or file, or the context of this data. More specific examples could be the title, keywords, acquisition method with a certain analytical technique, and the list continues. Metadata should be supported by controlled vocabularies (ideally ontologies), and/or data formats.
Metadata gets more specialized as the domain it describes does, where the hierarchy of domains can correspond to a hierarchical metadata structure, enabling layers of multiple standards from more generic metadata, where it is completely domain-independent, moving to more specific ones.
Domain-Independent Metadata:
Metadata can be domain-independent, focusing mostly on citation details, such as the title, the keywords, the people and institutions involved, or references to other data. Domain-independent metadata standards can be complemented by more domain-specific metadata.
-
Dublin Core is a more general set of fifteen elements describing networked resources. This set has been adapted and extended by other standards since its first publication in 1995.
-
DataCite is a DOI provider that provides a schema of core metadata for research data. The standard is community driven and tries to integrate with other standards such as Dublin Core and ORCID Record Schema.
-
OpenAIRE Guidelines for Data Archive Managers provides an infrastructure, which facilitates interoperability between repositories adhering to those guidelines, which enhance data exposure and visibility. OpenAIRE has already adopted the DataCite schema but with some minor adjustments, such as accepting other persistent identifier schemes rather than the DOI, and some changes in the obligations of properties.
-
PROV: The W3C standard for provenance information can be used to provide information on the origin of scientific data.
-
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a framework for harvesting metadata and can be applied to a wide variety of metadata formats. These should always include Dublin Core metadata.
Domain-Specific Metadata:
Metadata can be domain-specific, such as acquisition method with a certain analytical technique, or the pH for a certain reaction, which don’t apply to most other domains rather than chemistry.
- Core Scientific Metadata Model (CSMD) is a model for scientific studies, and it includes entity classes for facilities, users, investigations, instruments, datafiles, datasets, and samples. Within these classes most of the experimental parameters and results can be captured. Additionally, there are classes for e.g. publications, data formats, and sample types. Beside a publication of the specification as UML (Unified Modeling language) classes model definition, there is also a representation as an ontology. Future releases will focus on the integration of the PROV model.
- ISA (Investigation Study Assay) is also a metadata framework focusing on biological investigations, and it has schemas for the representation in data formats (ISA-Tab and JSON). It can be applied to many methods and allows the inclusion of ontology references for the entities.
- IUPAC - FAIRSpec: It covers spectroscopic data including NMR spectroscopy. But this project is still preliminary and under development.
Minimum information standards (MI)
Minimum information standards (MI) are guidelines regarding which metadata is required when reporting data. Furthermore, these guidelines outline which format to use for both this information as well as for the data itself. The set of MI depends on the type of data and is established to ensure that data are deposited following the FAIR principles. Therefore, minimum information is a subset of rich metadata which can accompany data.
Minimum Information for Chemical Investigations (MIChI)
Due to the increasing amount of data produced by omics, biology and related disciplines, such as bioinformatics and biochemistry, have developed a large set of minimum information guidelines for different methods. These were promoted by the Minimum Information for Biological and Biomedical Investigations (MIBBI) project.
Although the explored part of the chemical space along with the chemical data produced is increasing rapidly, there are only a few attempts to define guidelines for minimum information in chemistry, e.g. Metabolomics Standards Initiative (MSI) or Collaboratory for the Multi-scale Chemical Sciences (CMCS). The NFDI4Chem will address this issue and is working on Minimum Information for Chemical Investigations (MIChI), which includes standards for methods such as mass spectrometry, nuclear magnetic resonance and other spectroscopic methods. International workshops are already being carried out in order to start the needed discussion about the MIChI.
Software projects such as electronic lab notebooks or repositories often define their own layer of specific minimum metadata for chemical experiments which are based on existing standards, e.g. for metabolomics, or defined by the data formats they import.
Existing ontologies are a good starting point to identify the information necessary to describe a method, results, samples, or other entities. Furthermore, controlled vocabularies and ontologies define what additional metadata is allowed in order to create rich metadata, in turn improving the data's FAIRness. Examples for formats with corresponding ontologies or a controlled vocabularies are mzML, CIF, NeXus, and the Allotrope Data Format (ADF).
The Chemical Analysis Metadata Platform (ChAMP) is a project, which focuses on defining a framework for chemical analysis methods.
Metadata and the FAIR Principles
The FAIR Guiding Principles do not only apply to data but also the associated metadata. More information can be found in the FAIR article or on GoFair.
Metadata, as well as the data itself, should be assigned unique persistent identifiers (PID) to be referenced in publications and other datasets. By organizing these PIDs hierarchically, each parameter in the metadata can be referenced individually. Machine-readable metadata should be provided in a standardized format, while the metadata entities should be well-documented regarding semantics and the relations between the entities and the actual data. This can be achieved by defining the metadata as an ontology or schema, e.g. as XML or JSON. Schemas help in indexing metadata for search engines, repositories, or other data registries, and also help improve interoperability—the I in FAIR. Most of the other FAIR guidelines also apply to metadata.
Sources and further information
A short introductory video to Metadata can be found here (in German):