Abstract
FAIRification of personal health data is of utmost importance to improve health research and political as well as medical decision-making, which ultimately contributes to a better health of the general population. Despite the many advances in information technology, several obstacles such as interoperability problems remain and relevant research on the health topic of interest is likely to be missed out due to time-consuming search and access processes. A recent example is the COVID-19 pandemic, where a better understanding of the virus’ transmission dynamics as well as preventive and therapeutic options would have improved public health and medical decision-making. Consequently, the NFDI4Health Task Force COVID-19 was established to foster the FAIRification of German COVID-19 studies.
This paper describes the various steps that have been taken to create low barrier workflows for scientists in finding and accessing German COVID-19 research. It provides an overview on the building blocks for FAIR health research within the Task Force COVID-19 and how this initial work was subsequently expanded by the German consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health) to cover a wider range of studies and research areas in epidemiological, public health and clinical research. Lessons learned from the Task Force helped to improve the respective tasks of NFDI4Health.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Knowledge gain in health research is typically obtained from empirical studies, such as clinical trials, epidemiological studies or public health surveys. In times of digitalization, data that have been generated in one empirical health study frequently become relevant in other research contexts. In fact, reusable data with rich metadata is one asset of good scientific practice. Thus, it is important to provide scientists with an accessible and comprehensive overview of existing studies that are available for reuse. In an ideal world, a researcher would refer to a centralized information source, enter a set of search terms and obtain an overview of potentially relevant studies for a scientific question. Unnecessary or redundant research would be avoided, thereby saving time and monetary resources. However, despite the many advances in information technology, several obstacles remain and relevant research on the topic of interest is likely to be missed due to time-consuming search and access processes. It remains a cumbersome task to find relevant studies and information on how they were designed, how the data were collected and how the data can be accessed, if possible at all.
The lack of overarching data management pipelines is particularly threatening when there is an urgent need for scientific information and knowledge exchange. A recent example was the COVID-19 pandemic, where a better understanding of the transmission dynamics of the virus as well as preventive and therapeutic options would have improved public health and medical decision-making. Since the outbreak in 2019, countless projects emerged to investigate SARS-CoV‑2 and its consequences on population health. This huge number of studies can hardly be tracked, making it impossible to obtain a reliable overview on COVID-19 studies. This was also the case in Germany, where no comprehensive overview of completed, ongoing or planned research activities bridging epidemiological, public health and clinical research had been available at the time of the pandemic outbreak.
Despite all efforts to make empirical studies and their data better findable, accessible, interoperable, and reusable according to the FAIR principles [1], many challenges still exist: (i) A central search hub for empirical health studies including a central access point is still missing. (ii) Although several metadata models had been suggested, no consensus model exists [2, 3]. (iii) Organizing the necessary study information in a searchable form and choosing an adequate search portal are further hurdles. The extent of these problems varies along the requirements for registering studies within different research areas. Whereas clinical trials need to be registered, for example, within the International Clinical Trials Registry Platform (ICTRP) [4] or the German Register of Clinical Trials (DRKS) [5], there is no such requirement for epidemiological and public health studies. (iv) Studies of interest need to be presented in a comprehensive manner, while organizing the complex information about potentially thousands of variables. Tools to do so have been put in place by different projects and networks, such as Maelstrom [6]. (v) Finally, if data become available, it would be necessary to assess their scientific quality before reusing them [7].
In the light of the COVID-19 pandemic and the urgent need to overcome these hurdles, the NFDI4Health Task Force COVID-19 was established to simplify workflows for scientists to find and access German COVID-19 research. To ensure the sustainability of results generated by the Task Force COVID-19, they have been taken up by the German consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health). Conversely, lessons learned from the Task Force helped to optimize the respective tasks of NFDI4Health.
This paper provides an overview on the building blocks for FAIR health research within the Task Force COVID-19 and how this initial work was subsequently expanded by the NFDI4Health to cover a wider range of studies in epidemiological, public health and clinical research.
2 Building Blocks to Make COVID-19 Data FAIR and Approaches for Upscaling
2.1 Joint Metadata Model
Sharing information on studies, their design, and details of the collected data is a precondition for FAIR sciences. Therefore, we generated an information model that integrates information from clinical trials, epidemiological and public health COVID-19 studies and their resources such as instruments, documents, data collections, and others [2]. In principle, we followed a practical and agile hands-on approach rather than doing an in-depth analysis and planning beforehand to respond to the COVID-19 pandemic in due time. The information model (see [8] for the latest version MDS V3.3) is partially based on existing standards, such as the DataCite metadata schema [9] for general and bibliographic information, the MIABIS model for observational research originally designed for bio banking metadata [10], or the Maelstrom taxonomy [6] for epidemiological studies, as well as from clinical trials registries, such as ClinicalTrials.gov or DRKS [5]. Without fully replicating those underlying metadata models, we combined some of their major metadata items (132 in MDS V3.3) and corresponding value sets (260 in MDS V3.3) with 88 further items and attributes. We also included 127 additional values defined and required directly by the researchers from the domain, e.g. to find studies with certain nutritional conditions or including a specific chronic disease. It is designed in a modular way: In the core module, it covers metadata attributes describing the general type of content (e.g. study, dataset, data dictionary, study protocol, etc.), bibliographic information, contributors, identifiers and provenance information on the studies and their resources. In the resource design module, it covers design elements of studies (e.g., study type, study region, sample size, eligibility criteria, outcomes, interventions, population, health conditions and similar) and information on data sharing aspects, such as licenses and data access (see Fig. 1). Additional modules cover specific metadata for nutritional studies, for studies on chronic diseases and for record linkage. The metadata items are classified by their data types as codable concepts (with their corresponding value set domains), backbone elements grouping related data items and concepts, text strings, integer numbers, Boolean variables and URLs. Cardinalities were defined to distinguish between mandatory and optional data items, if required conditional cardinalities were defined to represent dependencies between the items. Specific data items were included in the core and resource design module that allow for connecting the different modules. These items store information on whether a certain module is required for a given dataset.
NFDI4Health Taskforce COVID-19 and NFDI4Health Metadata Schema (Current version 3.3, [8])
To enhance compatibility with existing data models, a complete mapping of the schema was conducted to reference models used by clinical trials registries [4, 5] and to the Clinical Research Metadata Repository of the European Clinical Research Infrastructure Network (ECRIN) [11]. Other models and standards considered that were relevant for COVID-19 studies helped shaping the content of the metadata model include the Health Level 7 Fast Healthcare Interoperability Resources (HL7® FHIR®) [12] and the Operational Data Model by the Clinical Data Interchange Standards Consortium (CDISC ODM) [13]. For the value sets of the data items we integrated concepts from domain-specific international terminologies, ontologies and classifications, such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), NCI Thesaurus (NCIt), Unified Medical Language System (UMLS), and International Statistical Classification of Diseases (ICD). To allow for upscaling, the current information model has been designed in a generic fashion for health studies in general, not solely for COVID-19 studies (see Sect. 3). Being designed this way, the metadata model is able to capture information on a wide range of potential use scenarios. Examples represent and compare clinical as well as observational studies instead of being confined to one of these two areas. This offers the possibility to find studies of interest and gain an understanding on how to access their data. Another feature is the ability to represent documents and other study-related resources within the framework, for example to make standard operating procedures findable and accessible. Also, metadata may be hosted which describe the collected data in detail. In conclusion, a main advantage of the MDS over other models lies in its broad scope.
2.2 The COVID-19 Central Health Study Hub
Subsequently, a web-application was designed to make the collected information on epidemiological, public health and clinical research browsable, the so-called COVID-19 Central Health Study Hub. It provides a ‘one stop shop’ for content related to German COVID-19 studies. The initial service was constructed based on a software platform for data sharing, FAIRDOM SEEK [14], for storing and indexing metadata and materials, as well as to assign DOIs. SEEK is a well-established life science data and model management platform developed by the FAIRDOM initiative and built as open-source software [14]. This allowed us to reuse many features and to adapt the system only where needed, e.g. by implementing the tailored metadata schema and a corresponding interface to allow users to capture the metadata. Background services were developed to automatically transfer metadata from existing data sources like the DRKS [5], ClinicalTrials.gov and the WHO ICTRP [4] into the Health Study Hub. An architectural sketch is depicted in Fig. 2.
An architectural sketch of the main components of the German Central Health Study Hub. The component labelled ‘Application Programming Interface’ is an anti-corruption layer, i.e. a façade pattern, that combines and aggregates different subsystems and provides a RESTful API to a component labelled ‘User Interface’, which is a single page application that consumes the aforementioned API and creates an appealing user interface
One main challenge was the creation of content for the COVID-19 Central Health Study Hub because interfacing with existing resources worked for clinical trials but not for epidemiological and public health studies. For these, a manual data collection was required. Members of our Task Force COVID-19 team contacted study representatives in person and/or extracted information from websites or preprint publications. In addition, a user interface for the COVID-19 Central Health Study Hub has been set up to allow registered study experts to manually share and/or review the detailed meta-information of their studies.
Furthermore, Application Programming Interfaces (APIs) connect the COVID-19 Central Health Study Hub to data holding organizations and other health infrastructures to integrate novel content. The feasibility of this approach has been demonstrated by cooperation with the COVID-19 Research Infrastructure Platform for Children and Adolescents (coverCHILD, [15]) of the Network University Medicine in Germany (NUM, [16]) where 293 child-related COVID-19 studies were added to the COVID-19 Central Health Study Hub.
The first version of the COVID-19 Central Health Study Hub was launched in 2020. It was subsequently expanded into the German Central Health Study Hub (GCHSH, [17]) (see Sect. 3).
2.3 Searching Instruments and Report Forms
Precisely knowing the collected variables is a precondition for deciding on their usefulness for individual or collaborated research. Such information is provided by variable labels, value labels, and semantic annotations based on some formal terminologies, such as those mentioned above, e.g. SNOMED CT, NCIt, ICD. While labels are commonly available in data dictionaries, this is not the case for semantic annotations. Given the time constraints of the Taskforce COVID-19, a lightweight approach for semantic annotation, the Maelstrom taxonomy [6], was adopted to enable general data discoverability. It distinguishes 13 domains and 138 subdomains of information and was applied to COVID-19 survey instruments and electronic case report forms to capture information. As the Maelstrom taxonomy is limited in complexity, we aimed to make the annotation more precise with further suitable terminologies: Following a requirements analysis, we developed a concept for annotation based on the standard “ISO/TS 21564:2019 Health Informatics—Terminology resource map quality measures (MapQual)” with several annotators and adjudication process. We used SNOMED CT, the most comprehensive international health terminology, to test its feasibility on smaller samples of COVID-19 questionnaires [18, 19]. These questionnaires were used in care or clinical, epidemiological and public health projects and mainly retrieved clinical information, administrative data and social determinants of health, concepts that can be found in SNOMED CT. On the one hand, the items could be annotated more precisely and used to code questions in FHIR Questionnaire instances. On the other hand, SNOMED CT is non-exhaustive and does not support annotations with categories such as “other” or “not specified”. Annotations turned out to be time- and resource-intensive due to the iterative development of annotation rules, the necessity of trained annotators, and the use of different SNOMED CT concepts to fully represent semantically a question.
Eventually, we integrated the Modular Integrated Concept Architecture (MICA) component of the Open Source Software for Epidemiology (OBIBA, [20]) in the COVID-19 Central Health Study Hub to store all semantically annotated with the Maelstrom Taxonomy [6]. We thus make the instruments available for online searches using the MICA metadata catalogue and the data discovery tool. Choices of other terminologies and ontologies, such as those represented in UMLS, require further evaluation to empirically guide the semantic representation of content and to align well with the requirements of the target community.
2.4 Assessing and Managing Data Quality
During the COVID-19 pandemic, it was extremely important to assess the quality of CT images of COVID-19-patients. For this purpose, we have developed and made publicly available an image analysis service that quantitatively evaluates CT images of these patients. In particular, the visible affected lung regions are detected automatically by a specifically trained deep learning artificial intelligence module and the affected tissue volume is calculated. This service enables researchers to uniformly evaluate DICOM image data from different sources and conduct multi-center studies. The implementation was made in collaboration with grand-challenge.org, an open platform for comparative medical image analysis science with over 70,000 registered users and about 2000 submissions per month, where the access to our service can be requested [21]. We see such quality assurance modules as valuable components of future health data research infrastructures. Multiple initiatives exist in Europe and elsewhere to address unsolved FAIRification challenges, such as common metadata models or standardized meta-tool repositories [22]. Future extensions of our work could include the provision of algorithms for quality assurance as a Docker container to be downloaded and used offline. Results of such a component could include tabular output covering all quality parameters for a larger set of images or also, e.g., a DICOM-structured report file for each of the individual DICOM image series.
Although of high relevance during the pandemic, assessing the quality of data is not limited to images. Regardless of the specific data source, users need to obtain structured information about data quality to take adequate decisions about further data processing and subsequent statistical analyses. To support this process, data quality reports should be generated based on machine-readable descriptions of expectations and requirements for the study data [23]. Such metadata might include information on range limits, contradictions, as well as information about examiners or devices. Only a few tools are currently capable of conducting such analyses in an automated fashion [24]. We were able to assess the quality of COVID-19 study data by successfully applying the software package dataquieR (Data Quality in Epidemiological Research for the programming language R). This software package [25] is openly available on the Comprehensive R Archive Network (CRAN). It assesses core data quality dimensions such as data integrity, completeness, consistency, and accuracy in a standardized manner, as outlined in a data quality framework for observational studies [7]. The easy-to-use tool produces a wide range of tabular and graphical outputs that can be accessed through a browser.
2.5 Learning About FAIRification of Study Data
During the COVID-19 pandemic, face-to-face training on research data management was not possible or very limited, while at the same time there was an increased demand for training due to the urgent need for high quality data from clinical trials and epidemiological studies. For instance, many observational studies were initiated in Germany, especially in the early days of the pandemic which often involved medical staff without prior Good Epidemiological Practice (GEP) and Good Clinical Practice (GCP) training. To ensure valid and reliable results from these studies, there was an urgent requirement for training specifically on the quality requirements for observational studies and GEP. Also, for randomized controlled clinical trials (RCT), there was an unmet need for online training courses.
Therefore, online trainings on ethical and methodological background for observational studies were developed according to GEP requirements. In five modules of 30 to 45 min each, training is provided on methodological and ethical backgrounds, design and conduct of epidemiological studies, data and quality management as well as data protection and publication of results. In conjunction with the training courses for GCP-compliant clinical trials developed and offered by the Coordinating Centers for Clinical Trials (KKS) and the KKS Network, specific online courses are available for observational or epidemiological studies as well as for interventional studies. Participants can attend the various modules according to their individual time resources.
2.6 Record Linkage
During the pandemic, it became undeniably clear, that Germany faces major challenges in linking personal health data. German data protection laws hinder the establishment of a unique identifier that would facilitate linking individual health data from different sources. For instance, it was not possible to link COVID-19 vaccination data with health insurance data to learn about potential adverse reactions [26]. In addition, urgently needed follow-up studies on potential long-term consequences of a COVID-19 infection are hampered by applicable concepts for record linkage. To address this major challenge, various actions were taken by the Task Force COVID-19 and subsequently continued in the NFDI4Health: First, a template for a consent form and a guidance document on the corresponding legislation have been provided by the Task Force COVID-19 to facilitate record linkage of individual health data [27]. Second, the Task Force COVID-19 actively sought for opportunities to discuss the need for record linkage of health data with policy makers. It initiated a white paper on this topic by bringing together a broad range of health researchers and the community [28]. This activity was then taken up by the NFDI4Health (see Sect. 3).
3 Approaches for Upscaling
NFDI4Health built on the lessons learned and extended the activities and efforts of the Task Force COVID-19. In the following, we will briefly describe some of these advancements.
Metadata Schema
Based on the metadata model from the Task Force COVID-19 (see Sect. 2.2), a multidisciplinary group of experts within NFDI4Health further developed the metadata model and extended it for other domains in the health research aiming at a generic schema to describe metadata for any health-related studies. Collaborations and joint workshops between stakeholders from different health disciplines, IT experts and data modelers led to the addition of further data items, and value sets, as well as to agreed definitions that meet the target communities’ needs. Taking particular requirements from the NFDI4Health community into account, domain-specific metadata modules, e.g., for nutritional epidemiology, for chronic diseases, or for record linkage, were developed and added to the schema (see Fig. 1). Further iterative refinements of the metadata model are based on implemented feedback mechanisms, both internally and externally.
In addition, the FAIRness of the metadata schema has been improved by mapping its elements and values to international semantic (SNOMED CT, ICD, NCIt, etc.) and syntactic (FHIR) standards after an analysis of different standards used in health and life sciences. FHIR is the most used information technology standard to store and exchange a wide range of data in health-related contexts. The research and public health domains are increasingly addressed as shown by the different areas of application [29], the current use in other research initiatives, and major changes of the FHIR resource ‘ResearchStudy’ in the last and upcoming releases to encompass the requirements of the different health research communities. To render the metadata model, which was initially designed in a spreadsheet file, findable and accessible, it was modeled in ART DÉCOR [30], an open source collaborative platform to create, describe, manage and deploy health data models. Additionally, we created logical models, profiles and value sets along with an implementation guide to represent the metadata schema in FHIR and published these on SIMPLIFIER [31], a widely used publishing platform and registry of FHIR projects. Both tools enable users to browse through the metadata model online, and thus allow for a user-friendly representation and clear visualization of the metadata schema. The further FAIRification process of the metadata schema is assessed on the basis of the criteria proposed by the Research Data Alliance (RDA) [32]. An important future development will be the elaboration of interfaces between different standards to facilitate the exchange of information and to not exclude potential users.
German Central Health Study Hub
Since its launch, the COVID-19 Central Health Study Hub has been continuously improved and expanded based on user feedback. Various measures were taken to upscale the Health Study Hub to serve the needs of the health research community (see Fig. 2). For instance, SEEK was replaced by Dataverse [33] for (meta-)data management and standard publication workflows. To provide an interconnected platform, the user front-end now combines the Dataverse software and a tool for browsing instruments (see Sect. 2.3). To date, 1926 studies, instruments and other documents have been made available on the Health Study Hub. International connectivity is pursued by linking our national services to international platforms (e.g., the European Health Information Portal [34]).
Structured Search Tools
Given the viability of the Maelstrom approach for semantic annotation, we initiated a close collaboration with the Maelstrom Research Group to annotate further major studies. This approach allows a structured search for important instruments of major health studies, such as the Study of Health in Pomerania (SHIP) [35] and the IDEFCIS/I. Family cohort [36], that are now accessible on the Health Study Hub. To support multi-hierarchical ontologies such as SNOMED CT and fully display these annotations, further developments of the underlying software components OPAL and MICA will be necessary.
Quality Assessment
The R software package dataquieR was further elaborated, mainly with regard to the scope of metadata to control automated data quality assessments. For this purpose, the potential of attributes in existing standards and data models to handle information of relevance to control data quality analyses was systematically assessed. We targeted CDISC Define-XML [37] HL7®FHIR® [11], REDCap [38, 39], the OMOP Common Data Model (OMOP—OHDSI) [40] and OpenClinica [41]. This assessment illustrated that these standards are rather basic in their capability to hold attributes of relevance to data quality assessments. The high interest of our target community in dataquieR is reflected in more than 15,000 downloads since 2020.
Training
NFDI4Health worked on general training courses for research data management and data science such as Data Train and hands-on virtual and on-site training sessions, e.g., for data stewards. For more details we refer to the “Handbook: Training concepts in research data management and data science with focus on health research” which has been compiled by NFDI4Health [42].
Record Linkage
Under the leadership of NFDI4Health, several health researchers and the user community, including cancer registries, the Medical Informatics Initiative and the Network University Medicine, jointly worked on a white paper [28] to (i) demonstrate the current status of record linkage of personal health data for research purposes in Germany by ten use cases from different health research areas, (ii) obtain a comprehensive overview of approaches for linking different data sources and (iii) provide recommendations to improve possibilities for record linkage in heath research in Germany. Recommendations include the introduction of a unique identifier and building a decentral infrastructure with central components as, e.g., an approval agency and a third-party trust center (see Fig. 3 for a schematic illustration) which would overcome the challenges identified in the record linkage use cases. The white paper targets in particular decision makers to stress the urgent need for a better regulation of record linkage of personal health data and for the necessary adaptation of German legislation. A first step in this direction is taken by the planned German Law on Health Data Use (Gesundheitsdatennutzungsgesetz [43]).
Simplified example of a possible request and data flow for record linkage of health data from two decentral data holding organizations (DHO) including the central components: central access point (CAP), approval agency (AA) and third-party trust center (TTPC) (ID: encrypted unique identifier). The CAP acts as single point of contact for researchers and forwards the record linkage request to a use and access committee of an AA. Once approved the CAP forwards the request to a TTPC and the respective DHOs who perform the record linkage to provide analysis datasets to the researcher in an analysis center
4 Discussion and Conclusion
The COVID-19 pandemic has been sudden and unforeseen, posing enormous challenges to individuals, society, medical care and the scientific communities. Therefore, the Task Force COVID-19 was established in a very short time. It contributed to FAIR epidemiological, public health and clinical research with several activities from the earliest days of the pandemic. It created a core of data models, tools, infrastructure, and services that have been taken up and upscaled by the NFDI4Health to become widely applicable within our target communities. In addition, the Task Force COVID-19 developed models in close collaboration with other German research groups to predict the evolution of the pandemic situation. These models have been used to describe the current situation and demonstrate the potential impact of political actions and human behavior.
Our developments are expected to serve as an important source of information in the post-pandemic era. The COVID-19 Central Health Study Hub was set up in a timely manner to manage the large number of COVID-19-related studies and instruments used therein. The study hub may serve as a directory to easily and quickly retrieve the necessary study information to promote more harmonized research activities. The metadata model stipulates discussions on how to elaborate and store information on studies in the targeted fields of research, thereby creating an opportunity for a standard to describe research. The model is currently discussed with other life science communities, such as for example genomics (The German Human Genome-Phenome, GHGA), or social sciences (KonsortSWD), but also with different initiatives that build study repositories (e.g. the European Clinical Research Infrastructure Network, ECRIN) to facilitate the joint use of information. Mapping metadata elements to a common and widely accepted standard, namely FHIR, may prove to be a key asset for interoperability with other platforms and for the long-term sustainability of the German Central Health Study Hub. While different standards have been considered for the European Health Data Space (EHDS), e.g., in studies requested by committees of the European Parliament, none has specifically been chosen yet. However, FHIR might play an important role in the EHDS especially for the exchange of electronic health records (EHR). As EHR are also increasingly used for the recruitment of study participants or as a secondary data source, building an ecosystem based on a common standard would allow to bridge clinic and research. In addition, to accommodate a wide range of potential users, we need to take other models such as OHDSI’s OMOP into account.
Better understanding the legal basis for data use and linkage may improve the focus of scientists to successfully carry out such activities. The consent form could serve as a template for further studies. The results of the project’s activities regarding data quality assessments may substantially foster reproducible science by increasing the scope, comparability and transparency of these assessments. NFDI4Health has also started activities to raise awareness to potentials of FAIR research in its community through numerous workshops, symposia and talks at community-specific conferences.
Despite all efforts, there were a number of hurdles needed to overcome. One of these is content creation. Extracting metadata from literature or web pages is a laborious process that does not scale. Qualified staff is necessary to ensure high quality entries in a central search hub, especially if semantic standards are to be adhered to in a consistent manner. The willingness to upload material cannot be taken for granted and strongly depends on efficient processes and sufficient support. It may be difficult to get started, as the benefits will only become apparent in the medium term, when the amount of information is sufficient to serve the community on a regular basis. Challenges also arise, for example, when material is only available in German, but English is desired for international cooperation. Also, options to licensing and DOI allocation are hardly known in the target communities and need to be made aware of by training, as the benefits will only be realized in the medium term. Substantial difficulties may arise from consenting common standards, as the target communities do not necessarily share the same scientific language. Such consensus procedures have proven to be lengthy and may not always lead to a consensus. While the manual curation of content may not be evitable if metadata need to be set up for the first time for some study or related resources, we will emphasize co-operations with services that have curated vast amounts of medical metadata at the study or at the item level. Examples are the Portal for Medical Data Models (MDM) [44], the Maelstrom Research Catalogue [6], or the euCanSHare catalogue [45]. This can be realized through interfaces between the used data models to store metadata as has previously been showcased by the integration of heterogenous COVID-19 resources in a knowledge graph [46]. In particular, for the MDM an interface to CDISC ODM would be necessary. Incentives for the participating partners could be making curated metadata mutually available as each portal has unique metadata features of relevance for some target community with varying metadata elements.
Based on the lessons learned, Table 1 summarizes recommendations to support future network activities on FAIR research.
In addition to the above recommendations, there is a need for cultural and political change. Scientific achievements should not only be measured based on scientific results published in peer-reviewed journals but also, for example, by shared und successfully used datasets, services provided to the community, transfer of results to the public or politics, and training of early career researchers. In particular, the sharing of data and tools should contribute to the scientific reputation of scientists. Awareness will be strengthened by the German Law on Health Data Use [43] and the European Health Data Space [47].
Concluding, the recommendations summarized in Table 1, raising awareness for the FAIR principles and striving for a cultural change may help to foster further collaborations among the various stakeholders and actors to eventually achieve that data sharing will become a natural concomitant under one National Research Data Infrastructure in Germany.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analyzed in the study.
Abbreviations
- AA:
-
Approval Agency
- API:
-
Application Programming Interfaces
- CAP:
-
Central Access Point
- CDISC ODM:
-
Operational Data Model by the Clinical Data Interchange Standards Consortium
- coverCHILD:
-
COVID-19 Research Infrastructure Platform for Children and Adolescents
- CRAN:
-
Comprehensive R Archive Network
- dataquieR:
-
Data Quality in Epidemiological Research
- DFG:
-
German Research Foundation
- DHO:
-
Data Holding Organizations
- DICOM:
-
Digital Imaging and Communications in Medicine
- DOI:
-
Digital Object Identifier
- DRKS:
-
German Register of Clinical Trials
- ECRIN:
-
European Clinical Research Infrastructure Network
- EHDS:
-
European Health Data Space
- EHR:
-
Electronic Health Records
- FAIR:
-
Findable, Accessible, Interoperable, Reusable
- FHIR:
-
Fast Healthcare Interoperability Resources
- GCHSH:
-
German Central Health Study Hub
- GCP:
-
Good Clinical Practice
- GEP:
-
Good Epidemiological Practice
- GHGA:
-
German Human Genome-Phenome Archive
- ICD:
-
International Statistical Classification of Diseases
- ICTRP:
-
International Clinical Trials Registry Platform
- KKS:
-
Coordinating Centers for Clinical Trials
- KonsortSWD:
-
Konsortium für die Sozial‑, Verhaltens‑, Bildungs- und Wirtschaftswissenschaften
- MapQual:
-
Terminology resource map quality measures
- MDM:
-
Medical Data Models
- MIABIS:
-
Minimum Information About Biobank data Sharing initiative
- MICA:
-
Modular Integrated Concept Architecture
- NCIt:
-
NCI Thesaurus
- NFDI4Health:
-
Nationale Forschungsdateninfrastruktur für personenbezogene Gesundheitsdaten
- NUM:
-
Network University Medicine
- OMOP—OHDSI:
-
Observational Medical Outcomes Partnership—Observational Health Data Sciences and Informatics
- RCT:
-
Controlled Clinical Trials
- RDM:
-
Research Data Management
- SHIP:
-
Study of Health in Pomerania
- SNOMED CT:
-
Systematized Nomenclature of Medicine—Clinical Terms
- TTPC:
-
Third-Party Trust Center
- UMLS:
-
Unified Medical Language System
- WHO:
-
World Health Organization
References
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J‑W, da Silva SLB, Bourne PE (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3(1):1–9
Schmidt CO, Darms J, Shutsko A, Lobe M, Nagrani R, Seifert B, Lindstadt B, Golebiewski M, Koleva S, Bender T, Bauer CR, Sax U, Hu X, Lieser M, Junker V, Klopfenstein S, Zeleke A, Waltemath D, Pigeot I, Fluck J (2021) COVID NFHTF. Facilitating Study and Item Level Browsing for Clinical and Epidemiological COVID-19 Studies. Stud Health Technol Inform 281:794–798. https://doi.org/10.3233/SHTI210284
Sass J, Bartschke A, Lehne M, Essenwanger A, Rinaldi E, Rudolph S, Heitmann KU, Vehreschild JJ, von Kalle C, Thun S (2020) The German Corona Consensus Dataset (GECCO): a standardized dataset for COVID-19 research in university medicine and beyond. BMC Med Inform Decis Mak 20(1):341. https://doi.org/10.1186/s12911-020-01374-w
World Health Organization (WHO) (2024) International Clinical Trials Registry Platform (ICTRP)/ICTRP search portal. https://www.who.int/clinical-trials-registry-platform/the-ictrp-search-portal. Accessed 2024-31-01.
Deutsches Register Klinischer Studien (DRKS) (2024). German Clinical Trials Register. https://www.bfarm.de/EN/BfArM/Tasks/German-Clinical-Trials-Register/_node.html. Accessed 2024-31-01.
Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I (2018) Fostering population-based cohort data discovery: The Maelstrom Research cataloguing toolkit. PLoS ONE 13(7):e200926. https://doi.org/10.1371/journal.pone.0200926
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, Huebner M, Schmidt B, Sauerbrei W, Richter A (2021) Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol 21(1):63. https://doi.org/10.1186/s12874-021-01252-7
Abaza A, Shutsko A, Golebiewski M, Klopfenstein S, Schmidt CO, Vorisek C, Brünings-Kuppe C, Clemens V, Darms J, Hanß S, Intemann T, Jannasch F, Kasbohm E, Lindstädt B, Löbe M, Orban E, Perrar I, Peters M, Sax U, Schulze M, Schupp C, Schwarz F, Schwedhelm C, Strathmann S, Waltemath D, Wünsche H, Zeleke AA, Atinkut A (2023) Metadata schema of the NFDI4Health and the NFDI4Health Task Force COVID-19 (V3_3). Fachrepositorium Lebenswissenschaften. https://doi.org/10.4126/FRL01-006453422. Accessed 2023-1-31.
DataCite Metadata Working Group (2019). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.3. DataCite e. V. . https://schema.datacite.org/meta/kernel-4.3/. Accessed 2024-01-31.
Merino-Martinez R, Norlin L, van Enckevort D, Anton G, Schuffenhauer S, Silander K, Mook L, Holub P, Bild R, Swertz M, Litton JE (2016) Toward Global Biobank Integration by Implementation of the Minimum Information About BIobank Data Sharing (MIABIS 2.0 Core). Biopreserv Biobank 14(4):298–306. https://doi.org/10.1089/bio.2015.0070
European Clinical Research Infrastructure Network (ECRIN) (n. d.). Clinical Research Metadata Repository. https://ecrin.org/clinical-research-metadata-repository. Accessed 2024-02-14.
HL7 International (2023). HL7® FHIR®, Release 5. https://hl7.org/FHIR/consent.html. Accessed 2024-01-31.
Matsuzaki K, Kitayama M, Yamamoto K, Aida R, Imai T, Ishida M, Katafuchi R, Kawamura T, Yokoo T, Narita I, Suzuki Y (2023) A Pragmatic Method to Integrate Data From Preexisting Cohort Studies Using the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model: Case Study. JMIR Med Inform 11:e46725. https://doi.org/10.2196/46725
Wolstencroft K, Owen S, Krebs O, Nguyen Q, Stanford NJ, Golebiewski M, Weidemann A, Bittkowski M, An L, Shockley DSEEK (2015) a systems biology data and model management platform. BMC Syst Biol 9(1):1–12
coverCHILD (2024). COVID-19 Forschungsplattform für Kinder und Jugendliche. https://coverchild.de/. Accessed 2024-01-31.
Network of University Medicine in Germany (2024). Über uns. https://www.netzwerk-universitaetsmedizin.de/. Accessed 2024-01-31.
NFDI4Health (2023). German Central Health Study Hub. https://csh.nfdi4health.de/. Accessed 2023-12-01.
Vorisek CN, Essenwanger EA, Klopfenstein SAI, Sass J, Henke J, Schmidt CO, Thun S (2022) COVID NFHTF. Implementing SNOMED CT in Open Software Solutions to Enhance the Findability of COVID-19 Questionnaires. Stud Health Technol Inform 294:649–653. https://doi.org/10.3233/SHTI220549
Vorisek CN, Klopfenstein SAI, Sass J, Lehne M, Schmidt CO, Thun S (2021) Evaluating Suitability of SNOMED CT in Structured Searches for COVID-19 Studies. Stud Health Technol Inform 281:88–92. https://doi.org/10.3233/SHTI210126
Doiron D, Marcon Y, Fortier I, Burton P, Ferretti V (2017) Software Application Profile: Opal and Mica: open-source software solutions for epidemiological data management, harmonization and dissemination. Int J Epidemiol 46(5):1372–1378
Lassen-Schmidt B, Köhn A, Link F, Thiemann MT, Hoinkiss D, Hirsch J, Hahn HK (2023) Demonstrator of SATORI Lung Analysis with integrated image quality analysis. Grand-Challenge.org. https://grand-challenge.org/reader-studies/satori-nfdi4health-test/. Accessed 2024-02-08.
Kondylakis H, Kalokyri V, Sfakianakis S, Marias K, Tsiknakis M, Jimenez-Pastor A, Camacho-Ramos E, Blanquer I, Segrelles JD, López-Huguet S, Barelle C, Kogut-Czarkowska M, Tsakou G, Siopis N, Sakellariou Z, Bizopoulos P, Drossou V, Lalas A, Votis K, Mallol P, Marti-Bonmati L, Alberich LC, Seymour K, Boucher S, Ciarrocchi E, Fromont L, Rambla J, Harms A, Gutierrez A, Starmans MPA, Prior F, Gelpi JL, Lekadir K. Data infrastructures for AI in medical imaging: a report on the experiences of five EU projects. Eur Radiol Exp. 2023 May 8;7(1):20. https://doi.org/10.1186/s41747-023-00336-x
Richter A, Schössow J, Werner A, Schauer B, Radke D, Henke J, Struckmann S, Schmidt CO (2019) Data quality monitoring in clinical and observational epidemiologic studies: the role of metadata and process information. MIBE. https://doi.org/10.3205/mibe000202
(2022) Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Applied. Sciences. https://doi.org/10.3390/app12094238
Richter A, Schmidt CO, Krüger M, Struckmann S (2021) dataquieR: assessment of data quality in epidemiological research. JOSS 6(61):3039. https://doi.org/10.21105/joss.03093
DGEpi & GMDS (2020). Stellungnahme der Deutschen Gesellschaft für Epidemiologie (DGEpi) und der Deutschen Gesellschaft für Medizinische Informatik Biometrie und Epidemiologie(GMDS) zum Beschlussentwurf der STIKO für die Empfehlung der COVID-19-Impfung und für die dazugehörige wissenschaftliche Begründung. https://www.dgepi.de/assets/Stellungnahme_STIKO-2020-12-10_final.pdf. Accessed 2024-02-14.
Intemann T, Lettieri V, Kipker D‑K, Kuntz A, Ahrens W, Pigeot I, Buchner B, Sax U (2023) Informed Consent zum Record Linkage: Best Practice und Mustertexte https://doi.org/10.4126/FRL01-006399943
Intemann T, Kaulke K, Kipker DK, Lettieri V, Stallmann C, Schmidt CO, Geidel L, Bialke M, Hampf C, Stahl D, Lablans M, Rohde F, Franke M, Kraywinkel K, Kieschke J, Bartholomäus S, Näher AF, Tremper G, Lambarki M, March S, Prasser F, Haber AC, Drepper J, Schlünder I, Kirsten T, Pigeot I, Sax U, Buchner B, Ahrens W, Semler SC (2023) White Paper: Verbesserung des Record Linkage für die Gesundheitsforschung in. Deutschland https://doi.org/10.4126/FRL01-006461895
Vorisek CN, Lehne M, Klopfenstein SAI, Mayer PJ, Bartschke A, Haese T, Thun S. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review. JMIR Med Inform. 2022 Jul 19;10(7):e35724. https://doi.org/10.2196/35724. PMID: 35852842; PMCID: PMC9346559.
NFDI4Health (2024). NFDI4Health Metadata Schema. https://nfdi4health.art-decor.pub/. Accessed 2024-03-26.
Klopfenstein SAI and Vorisek CN, Saß J, Hölter, Thimo A, Thun S, NFDI4Health Metadata Schema Group (2024). Implementation Guide of the NFDI4Health Metadata Schema (MDS) Version 3.3. https://simplifier.net/guide/nfdi4health—metadata-schema—implementationguide?version=current. Accessed 2024-03-26.
FAIR Data Maturity Model Working Group (2020) FAIR Data Maturity Model: specification and guidelines. Zenodo https://doi.org/10.15497/rda00050
Crosas MT, Network D (2011) An Open-Source Application for Sharing, Discovering and Preserving Data. D‑lib Mag 17(1):2
European Health Data Space (EHDS) (2024). European Health Information Portal. https://www.healthinformationportal.eu/. Accessed 2024-01-31.
Volzke H, Schossow J, Schmidt CO, Jurgens C, Richter A, Werner A, Werner N, Radke D, Teumer A, Ittermann T, Schauer B, Henck V, Friedrich N, Hannemann A, Winter T, Nauck M, Dorr M, Bahls M, Felix SB, Stubbe B, Ewert R, Frost F, Lerch MM, Grabe HJ, Bulow R, Otto M, Hosten N, Rathmann W, Schminke U, Grossjohann R, Tost F, Homuth G, Volker U, Weiss S, Holtfreter S, Broker BM, Zimmermann K, Kaderali L, Winnefeld M, Kristof B, Berger K, Samietz S, Schwahn C, Holtfreter B, Biffar R, Kindler S, Wittfeld K, Hoffmann W, Kocher T (2022) Cohort Profile Update: The Study of Health in Pomerania (SHIP). Int J Epidemiol 51(6):e372–e83. https://doi.org/10.1093/ije/dyac034
Ahrens W, Siani A, Adan R, De Henauw S, Eiben G, Gwozdz W, Hebestreit A, Hunsberger M, Kaprio J, Cohort Profile KV (2017) The transition from childhood to adolescence in European children—how I. Family extends the IDEFICS cohort. Int J Epidemiol 46(5):1394–135j
cdisc (n. d.). Define-XML. https://www.cdisc.org/standards/data-exchange/define-xml. Accessed 2024-02-08.
Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG (2009) A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42(2):377–381
Harris PA, Taylor R, Minor BL, Elliott V, Fernandez M, O’Neal L, McLeod L, Delacqua G, Delacqua F, Kirby J (2019) The REDCap consortium: building an international community of software platform partners. J Biomed Inform 95:103208
Observational Health Data Sciences and Informatics (OHDSI) (n. d.). Standardized Data: The OMOP Common Data Model. https://www.ohdsi.org/data-standardization/. Accessed 2024-02-08.
OpenClinica (n. d.). Driving the future of clinical trials. https://www.openclinica.com/. Accessed 2024-02-08.
Dierkes J, Fürst J, Hörner T, Klammt S, Lindstädt B, Pigeot I, Restel K, Schmidt CO, Waltemath D, Zeleke A (2023) Training concepts in research data management and data science with the focus on health research. Zenodo https://doi.org/10.4126/FRL01-006441348
Bundesministerium für Gesundheit (2024) Gesundheitsdatennutzungsgesetz (GDNG). https://www.bundesgesundheitsministerium.de/ministerium/gesetze-und-verordnungen/guv-20-lp/gesundheitsdatennutzungsgesetz. Accessed 2024-02-08.
Ganzinger M, Blumenstock M, Niklas C, Dugas M (2023) Portal of Medical Data Models: Application in Federated Data Capture. Stud Health Technol Inform 302:137–138. https://doi.org/10.3233/SHTI230084
Reinikainen J, Palosaari T, Canosa-Valls AJ et al (2024) Cohort Profile: The Cardiovascular Research Data Catalogue. Int J Epidemiol. https://doi.org/10.1093/ije/dyad175
Gütebier L, Bleimehl T, Henkel R, Munro J, Müller S, Morgner A, Laenge J, Pachauer A, Erdl A, Weimar J, Langendorf WK, Vialard V, Liebig T, Preusse M, Waltemath D, CovidGraph JA (2022) a graph to fight COVID-19. Bioinformatics 38(20):4843–4845. https://doi.org/10.1093/bioinformatics/btac592
Bundesministerium für Gesundheit (2024) Europäischer Gesundheitsdatenraum (EHDS). https://www.bundesgesundheitsministerium.de/themen/internationale-gesundheitspolitik/europa/europaeische-gesundheitspolitik/ehds. Accessed 2024-02-19.
Acknowledgements
This work was done as part of the NFDI4Health Task Force COVID-19 (www.nfdi4health.de/task-force-covid-19-2) and NFDI4Health. Support with improving the metadata schema and designing Fig. 1 was kindly provided by Haitham Abaza (HITS). We gratefully acknowledge the financial support of the German Research Foundation (DFG). Project number NFDI4Health Task Force COVID-19: 451265285. DFG reference numbers: PI 345/17‑1 | SCHM 2744/9‑1 | FL 398/2‑1 | GR 4686/3‑1 | HA 5611/9‑1 | KL 3409/1‑1 | LO 342/17‑1 | MU 3099/5‑1 |SA 1009/8‑1 | TH 2503/1‑1; Project number NFDI4Health: 442326535. Additional support by the Klaus Tschira Foundation for HITS is also gratefully acknowledged.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
Iris Pigeot and Carsten Oliver Schmidt drafted the manuscript, were responsible for the overall conception, the design and the first draft of the manuscript. All authors made substantial contributions to the work, revised it critically for important intellectual content; approved the version to be published; and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. Thanks to all additional NFDI4Health researchers who contributed to this work.
Corresponding author
Ethics declarations
Conflict of interest
I. Pigeot, W. Ahrens, J. Darms, J. Fluck, M. Golebiewski, H.K. Hahn, X. Hu, T. Intemann, E. Kasbohm, T. Kirsten, S. Klammt, S.A.I. Klopfenstein, B. Lassen-Schmidt, M. Peters, U. Sax, D. Waltemath and C.O. Schmidt declare that they have no competing interests.
Ethical standards
For this article no studies with human participants or animals were performed by any of the authors. All studies mentioned were in accordance with the ethical standards indicated in each case.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pigeot, I., Ahrens, W., Darms, J. et al. Making Epidemiological and Clinical Studies FAIR Using the Example of COVID-19. Datenbank Spektrum 24, 117–128 (2024). https://doi.org/10.1007/s13222-024-00477-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-024-00477-2