The Joy of FAIR#
Working with data in terms of FAIR and in a digital environment means working with machine-readable data, therefore different activities, different steps to handle the data. This section will provide a brief background on machine-readable data and the FAIR data principles in the context of chemistry, what you can do with machine-readable chemical data and the importance of preparing data to be FAIR and discoverable in domain repositories.
FAIR-enabled data are structured and described to facilitate automated processing from machine to machine and system to system, and thus can be utilized for AI/ML and other digital applications. “Fully AI-Ready” data structures are systematically organized and consistently formatted so that algorithms can parse and operate on them. Processable data structures that adhere to discipline specific data standards and align with the FAIR data principles ensure quality and accuracy of reuse. FAIR data are accessible through programmatic interfaces and can be tapped directly within user code for automated exchange and analysis.
These concepts as framed by the FAIR principles manifest as concrete technical attributes for programmers and data scientists in practice. They can be abstractions for those not familiar with navigating from an informatics/data structure perspective. To make sense of the applicability of the FAIR data principles in the context of machine-readability, it is important to appreciate how use cases for using chemical data can match with technical attributes that enable data to be reused through automated means.
There are a number of scenarios where it can be useful to navigate across distributed data resources using programmatic methods - for example, a global search for specific chemicals, cross-exchange of chemical information between data repositories, validation of converted or predicted chemical representations, or integration of distributed data for compiled meta-analysis [PT23]. A machine processing workflow for reusing data might proceed through stages of discovery, retrieval, validation, curation, compilation, visualization, analysis and derivation. Each of these is dependent on some combination of consistently structured data, granular metadata description and reproducible protocols. Aligning these with FAIR community practices optimizes the workflow for repeated and reliable automated and scalable data reuse.
The FAIR Data Principles are framed from the perspective of data reuse. Process engineers, researchers, and an increasing number of automated processes need complete and unambiguous description of research results in convenient forms that are easy to find, retrieve and compile. To get more FAIR data out in consumable forms, we also need to consider the other side of the equation – critical parameters for documenting data during the lifecycle upstream of sharing to ensure that meaning and quality can be assessed and reassessed appropriately.