Metadata and Minimum Information
Metadata and their schemas
Metadata can be described as "data about data", i.e. structured information that describes data, like the content of a dataset or file, or the context of its generation. Some exemplary metadata fields are: title, keywords, acquisition method / analytical technique, and the list continues. Metadata should be supported by controlled vocabularies (ideally ontologies), and/or data formats.
Metadata gets more specialized as the domain it describes does, where the hierarchy of domains can correspond to a hierarchical metadata structure: from a more generic, completely domain-independent metadata layer, to the most method- and application-specific ones.
Domain-Independent Metadata:
Metadata can be domain-independent, focusing mostly on citation details, such as the title, the keywords, the people and institutions involved, or references to other data. Domain-independent metadata standards can be complemented by more domain-specific metadata.
- Dublin Core is a more general set of fifteen elements describing networked resources. This set has been adapted and extended by other standards since its first publication in 1995.
- DataCite is a DOI provider that provides a schema of core metadata for research data. The standard is community driven and tries to integrate with other standards such as Dublin Core and ORCID Record Schema.
- The OpenAIRE Guidelines for Data Archive Managers provide an infrastructure which facilitates interoperability between repositories adhering to those guidelines and enhances data exposure and visibility. OpenAIRE has already adopted the DataCite schema but with some minor adjustments, such as accepting other persistent identifier schemes rather than the DOI, and some changes in the obligations of properties.
- PROV: The W3C standard for provenance information can be used to provide information on the origin of scientific data.
- The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a framework for harvesting metadata and can be applied to a wide variety of metadata formats. These should always include Dublin Core metadata.
Domain-Specific Metadata:
Metadata can be domain-specific, i.e. related to a specific acquisition method with a certain analytical technique (such as a pH measurement in the context of a certain reaction), which doesn't apply to most other domains other than chemistry.
- The Core Scientific Metadata Model (CSMD) is a model for scientific studies, which includes entity classes for facilities, users, investigations, instruments, datafiles, datasets, and samples. Within these classes most of the experimental parameters and results can be captured. There are additionally classes for e.g. publications, data formats, and sample types. Beside a publication of the specification as UML (Unified Modeling language) classes model definition, there is also a representation as an ontology. Future releases will focus on the integration of the PROV model.
- The Investigation Study Assay (ISA) is also a metadata framework focusing on biological investigations, which defines schemas for the data representation in machine-readable formats (ISA-Tab and JSON). It can be applied to many methods and allows the inclusion of ontology references for the entities.
- IUPAC - FAIRSpec is a framework under development at IUPAC, which aims to cover spectroscopic data including NMR spectroscopy.
Minimum information standards (MI)
Minimum information standards (MI) are guidelines regarding which metadata is required when reporting data. Furthermore, these guidelines outline which format should be used for both this information as well as for the data itself. The set of MI depends on the type of data and is established to ensure that data are deposited following the FAIR principles. Therefore, minimum information is a subset of rich metadata which can accompany data.
Minimum Information for Chemical Investigations (MIChI)
Due to the increasing amount of data produced by biology and related disciplines, such as omics, bioinformatics and biochemistry, a large set of minimum information guidelines for different methods has been developed. These were promoted by the Minimum Information for Biological and Biomedical Investigations (MIBBI) project.
Although the explored part of the chemical space along with the chemical data produced is increasing rapidly, there are only a few attempts to define guidelines for minimum information in chemistry, e.g. the Metabolomics Standards Initiative (MSI) or the Collaboratory for the Multi-scale Chemical Sciences (CMCS). NFDI4Chem will address this issue by preparing recommendations on Minimum Information for Chemical Investigations (MIChI), which include standards for methods such as mass spectrometry, nuclear magnetic resonance and optical spectroscopic methods. International workshops are already being carried out in order to start the needed discussion about the MIChI.
Software projects such as electronic lab notebooks or repositories often define their own layer of specific minimum metadata for chemical experiments which are based on existing standards, e.g. for metabolomics, or defined by the data formats they import.
Existing ontologies are a good starting point to identify the information necessary to describe a method, results, samples, or other entities. Furthermore, controlled vocabularies and ontologies define what additional metadata is allowed in order to create rich metadata, in turn improving the data's FAIRness. Examples for formats with corresponding ontologies or a controlled vocabularies are mzML, CIF, NeXus, and the Allotrope Data Format (ADF).
The Chemical Analysis Metadata Platform (ChAMP) is a project which focuses on defining a framework for chemical analysis methods.
Metadata and the FAIR Principles
The FAIR Guiding Principles do not only apply to data but also the associated metadata. More information can be found in the FAIR article or on GoFair.
Metadata, as well as the data itself, should be assigned unique persistent identifiers (PID) to be referenced in publications and other datasets. By organizing these PIDs hierarchically, each parameter in the metadata can be referenced individually. Machine-readable metadata should be provided in a standardized format, while the metadata entities should be well-documented regarding semantics and the relations between the entities and the actual data. This can be achieved by defining the metadata as an ontology, or a schema in a machine-readable serialization, such as XML or JSON. Schemas help in indexing metadata for search engines, repositories, or other data registries, and also help improve interoperability (the I in FAIR). Most of the other FAIR guidelines also apply to metadata.
Sources and further information
A short introductory video to Metadata (in German) can be found here.