1 Introduction

Knowledge gain in health research is typically obtained from empirical studies, such as clinical trials, epidemiological studies or public health surveys. In times of digitalization, data that have been generated in one empirical health study frequently become relevant in other research contexts. In fact, reusable data with rich metadata is one asset of good scientific practice. Thus, it is important to provide scientists with an accessible and comprehensive overview of existing studies that are available for reuse. In an ideal world, a researcher would refer to a centralized information source, enter a set of search terms and obtain an overview of potentially relevant studies for a scientific question. Unnecessary or redundant research would be avoided, thereby saving time and monetary resources. However, despite the many advances in information technology, several obstacles remain and relevant research on the topic of interest is likely to be missed due to time-consuming search and access processes. It remains a cumbersome task to find relevant studies and information on how they were designed, how the data were collected and how the data can be accessed, if possible at all.

The lack of overarching data management pipelines is particularly threatening when there is an urgent need for scientific information and knowledge exchange. A recent example was the COVID-19 pandemic, where a better understanding of the transmission dynamics of the virus as well as preventive and therapeutic options would have improved public health and medical decision-making. Since the outbreak in 2019, countless projects emerged to investigate SARS-CoV‑2 and its consequences on population health. This huge number of studies can hardly be tracked, making it impossible to obtain a reliable overview on COVID-19 studies. This was also the case in Germany, where no comprehensive overview of completed, ongoing or planned research activities bridging epidemiological, public health and clinical research had been available at the time of the pandemic outbreak.

Despite all efforts to make empirical studies and their data better findable, accessible, interoperable, and reusable according to the FAIR principles [1], many challenges still exist: (i) A central search hub for empirical health studies including a central access point is still missing. (ii) Although several metadata models had been suggested, no consensus model exists [2, 3]. (iii) Organizing the necessary study information in a searchable form and choosing an adequate search portal are further hurdles. The extent of these problems varies along the requirements for registering studies within different research areas. Whereas clinical trials need to be registered, for example, within the International Clinical Trials Registry Platform (ICTRP) [4] or the German Register of Clinical Trials (DRKS) [5], there is no such requirement for epidemiological and public health studies. (iv) Studies of interest need to be presented in a comprehensive manner, while organizing the complex information about potentially thousands of variables. Tools to do so have been put in place by different projects and networks, such as Maelstrom [6]. (v) Finally, if data become available, it would be necessary to assess their scientific quality before reusing them [7].

In the light of the COVID-19 pandemic and the urgent need to overcome these hurdles, the NFDI4Health Task Force COVID-19 was established to simplify workflows for scientists to find and access German COVID-19 research. To ensure the sustainability of results generated by the Task Force COVID-19, they have been taken up by the German consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health). Conversely, lessons learned from the Task Force helped to optimize the respective tasks of NFDI4Health.

This paper provides an overview on the building blocks for FAIR health research within the Task Force COVID-19 and how this initial work was subsequently expanded by the NFDI4Health to cover a wider range of studies in epidemiological, public health and clinical research.

2 Building Blocks to Make COVID-19 Data FAIR and Approaches for Upscaling

2.1 Joint Metadata Model

Sharing information on studies, their design, and details of the collected data is a precondition for FAIR sciences. Therefore, we generated an information model that integrates information from clinical trials, epidemiological and public health COVID-19 studies and their resources such as instruments, documents, data collections, and others [2]. In principle, we followed a practical and agile hands-on approach rather than doing an in-depth analysis and planning beforehand to respond to the COVID-19 pandemic in due time. The information model (see [8] for the latest version MDS V3.3) is partially based on existing standards, such as the DataCite metadata schema [9] for general and bibliographic information, the MIABIS model for observational research originally designed for bio banking metadata [10], or the Maelstrom taxonomy [6] for epidemiological studies, as well as from clinical trials registries, such as ClinicalTrials.gov or DRKS [5]. Without fully replicating those underlying metadata models, we combined some of their major metadata items (132 in MDS V3.3) and corresponding value sets (260 in MDS V3.3) with 88 further items and attributes. We also included 127 additional values defined and required directly by the researchers from the domain, e.g. to find studies with certain nutritional conditions or including a specific chronic disease. It is designed in a modular way: In the core module, it covers metadata attributes describing the general type of content (e.g. study, dataset, data dictionary, study protocol, etc.), bibliographic information, contributors, identifiers and provenance information on the studies and their resources. In the resource design module, it covers design elements of studies (e.g., study type, study region, sample size, eligibility criteria, outcomes, interventions, population, health conditions and similar) and information on data sharing aspects, such as licenses and data access (see Fig. 1). Additional modules cover specific metadata for nutritional studies, for studies on chronic diseases and for record linkage. The metadata items are classified by their data types as codable concepts (with their corresponding value set domains), backbone elements grouping related data items and concepts, text strings, integer numbers, Boolean variables and URLs. Cardinalities were defined to distinguish between mandatory and optional data items, if required conditional cardinalities were defined to represent dependencies between the items. Specific data items were included in the core and resource design module that allow for connecting the different modules. These items store information on whether a certain module is required for a given dataset.

Fig. 1
figure 1

NFDI4Health Taskforce COVID-19 and NFDI4Health Metadata Schema (Current version 3.3, [8])

To enhance compatibility with existing data models, a complete mapping of the schema was conducted to reference models used by clinical trials registries [4, 5] and to the Clinical Research Metadata Repository of the European Clinical Research Infrastructure Network (ECRIN) [11]. Other models and standards considered that were relevant for COVID-19 studies helped shaping the content of the metadata model include the Health Level 7 Fast Healthcare Interoperability Resources (HL7® FHIR®) [12] and the Operational Data Model by the Clinical Data Interchange Standards Consortium (CDISC ODM) [13]. For the value sets of the data items we integrated concepts from domain-specific international terminologies, ontologies and classifications, such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), NCI Thesaurus (NCIt), Unified Medical Language System (UMLS), and International Statistical Classification of Diseases (ICD). To allow for upscaling, the current information model has been designed in a generic fashion for health studies in general, not solely for COVID-19 studies (see Sect. 3). Being designed this way, the metadata model is able to capture information on a wide range of potential use scenarios. Examples represent and compare clinical as well as observational studies instead of being confined to one of these two areas. This offers the possibility to find studies of interest and gain an understanding on how to access their data. Another feature is the ability to represent documents and other study-related resources within the framework, for example to make standard operating procedures findable and accessible. Also, metadata may be hosted which describe the collected data in detail. In conclusion, a main advantage of the MDS over other models lies in its broad scope.

2.2 The COVID-19 Central Health Study Hub

Subsequently, a web-application was designed to make the collected information on epidemiological, public health and clinical research browsable, the so-called COVID-19 Central Health Study Hub. It provides a ‘one stop shop’ for content related to German COVID-19 studies. The initial service was constructed based on a software platform for data sharing, FAIRDOM SEEK [14], for storing and indexing metadata and materials, as well as to assign DOIs. SEEK is a well-established life science data and model management platform developed by the FAIRDOM initiative and built as open-source software [14]. This allowed us to reuse many features and to adapt the system only where needed, e.g. by implementing the tailored metadata schema and a corresponding interface to allow users to capture the metadata. Background services were developed to automatically transfer metadata from existing data sources like the DRKS [5], ClinicalTrials.gov and the WHO ICTRP [4] into the Health Study Hub. An architectural sketch is depicted in Fig. 2.

Fig. 2
figure 2

An architectural sketch of the main components of the German Central Health Study Hub. The component labelled ‘Application Programming Interface’ is an anti-corruption layer, i.e. a façade pattern, that combines and aggregates different subsystems and provides a RESTful API to a component labelled ‘User Interface’, which is a single page application that consumes the aforementioned API and creates an appealing user interface

One main challenge was the creation of content for the COVID-19 Central Health Study Hub because interfacing with existing resources worked for clinical trials but not for epidemiological and public health studies. For these, a manual data collection was required. Members of our Task Force COVID-19 team contacted study representatives in person and/or extracted information from websites or preprint publications. In addition, a user interface for the COVID-19 Central Health Study Hub has been set up to allow registered study experts to manually share and/or review the detailed meta-information of their studies.

Furthermore, Application Programming Interfaces (APIs) connect the COVID-19 Central Health Study Hub to data holding organizations and other health infrastructures to integrate novel content. The feasibility of this approach has been demonstrated by cooperation with the COVID-19 Research Infrastructure Platform for Children and Adolescents (coverCHILD, [15]) of the Network University Medicine in Germany (NUM, [16]) where 293 child-related COVID-19 studies were added to the COVID-19 Central Health Study Hub.

The first version of the COVID-19 Central Health Study Hub was launched in 2020. It was subsequently expanded into the German Central Health Study Hub (GCHSH, [17]) (see Sect. 3).

2.3 Searching Instruments and Report Forms

Precisely knowing the collected variables is a precondition for deciding on their usefulness for individual or collaborated research. Such information is provided by variable labels, value labels, and semantic annotations based on some formal terminologies, such as those mentioned above, e.g. SNOMED CT, NCIt, ICD. While labels are commonly available in data dictionaries, this is not the case for semantic annotations. Given the time constraints of the Taskforce COVID-19, a lightweight approach for semantic annotation, the Maelstrom taxonomy [6], was adopted to enable general data discoverability. It distinguishes 13 domains and 138 subdomains of information and was applied to COVID-19 survey instruments and electronic case report forms to capture information. As the Maelstrom taxonomy is limited in complexity, we aimed to make the annotation more precise with further suitable terminologies: Following a requirements analysis, we developed a concept for annotation based on the standard “ISO/TS 21564:2019 Health Informatics—Terminology resource map quality measures (MapQual)” with several annotators and adjudication process. We used SNOMED CT, the most comprehensive international health terminology, to test its feasibility on smaller samples of COVID-19 questionnaires [18, 19]. These questionnaires were used in care or clinical, epidemiological and public health projects and mainly retrieved clinical information, administrative data and social determinants of health, concepts that can be found in SNOMED CT. On the one hand, the items could be annotated more precisely and used to code questions in FHIR Questionnaire instances. On the other hand, SNOMED CT is non-exhaustive and does not support annotations with categories such as “other” or “not specified”. Annotations turned out to be time- and resource-intensive due to the iterative development of annotation rules, the necessity of trained annotators, and the use of different SNOMED CT concepts to fully represent semantically a question.

Eventually, we integrated the Modular Integrated Concept Architecture (MICA) component of the Open Source Software for Epidemiology (OBIBA, [20]) in the COVID-19 Central Health Study Hub to store all semantically annotated with the Maelstrom Taxonomy [6]. We thus make the instruments available for online searches using the MICA metadata catalogue and the data discovery tool. Choices of other terminologies and ontologies, such as those represented in UMLS, require further evaluation to empirically guide the semantic representation of content and to align well with the requirements of the target community.

2.4 Assessing and Managing Data Quality

During the COVID-19 pandemic, it was extremely important to assess the quality of CT images of COVID-19-patients. For this purpose, we have developed and made publicly available an image analysis service that quantitatively evaluates CT images of these patients. In particular, the visible affected lung regions are detected automatically by a specifically trained deep learning artificial intelligence module and the affected tissue volume is calculated. This service enables researchers to uniformly evaluate DICOM image data from different sources and conduct multi-center studies. The implementation was made in collaboration with grand-challenge.org, an open platform for comparative medical image analysis science with over 70,000 registered users and about 2000 submissions per month, where the access to our service can be requested [21]. We see such quality assurance modules as valuable components of future health data research infrastructures. Multiple initiatives exist in Europe and elsewhere to address unsolved FAIRification challenges, such as common metadata models or standardized meta-tool repositories [22]. Future extensions of our work could include the provision of algorithms for quality assurance as a Docker container to be downloaded and used offline. Results of such a component could include tabular output covering all quality parameters for a larger set of images or also, e.g., a DICOM-structured report file for each of the individual DICOM image series.

Although of high relevance during the pandemic, assessing the quality of data is not limited to images. Regardless of the specific data source, users need to obtain structured information about data quality to take adequate decisions about further data processing and subsequent statistical analyses. To support this process, data quality reports should be generated based on machine-readable descriptions of expectations and requirements for the study data [23]. Such metadata might include information on range limits, contradictions, as well as information about examiners or devices. Only a few tools are currently capable of conducting such analyses in an automated fashion [24]. We were able to assess the quality of COVID-19 study data by successfully applying the software package dataquieR (Data Quality in Epidemiological Research for the programming language R). This software package [25] is openly available on the Comprehensive R Archive Network (CRAN). It assesses core data quality dimensions such as data integrity, completeness, consistency, and accuracy in a standardized manner, as outlined in a data quality framework for observational studies [7]. The easy-to-use tool produces a wide range of tabular and graphical outputs that can be accessed through a browser.

2.5 Learning About FAIRification of Study Data

During the COVID-19 pandemic, face-to-face training on research data management was not possible or very limited, while at the same time there was an increased demand for training due to the urgent need for high quality data from clinical trials and epidemiological studies. For instance, many observational studies were initiated in Germany, especially in the early days of the pandemic which often involved medical staff without prior Good Epidemiological Practice (GEP) and Good Clinical Practice (GCP) training. To ensure valid and reliable results from these studies, there was an urgent requirement for training specifically on the quality requirements for observational studies and GEP. Also, for randomized controlled clinical trials (RCT), there was an unmet need for online training courses.

Therefore, online trainings on ethical and methodological background for observational studies were developed according to GEP requirements. In five modules of 30 to 45 min each, training is provided on methodological and ethical backgrounds, design and conduct of epidemiological studies, data and quality management as well as data protection and publication of results. In conjunction with the training courses for GCP-compliant clinical trials developed and offered by the Coordinating Centers for Clinical Trials (KKS) and the KKS Network, specific online courses are available for observational or epidemiological studies as well as for interventional studies. Participants can attend the various modules according to their individual time resources.

2.6 Record Linkage

During the pandemic, it became undeniably clear, that Germany faces major challenges in linking personal health data. German data protection laws hinder the establishment of a unique identifier that would facilitate linking individual health data from different sources. For instance, it was not possible to link COVID-19 vaccination data with health insurance data to learn about potential adverse reactions [26]. In addition, urgently needed follow-up studies on potential long-term consequences of a COVID-19 infection are hampered by applicable concepts for record linkage. To address this major challenge, various actions were taken by the Task Force COVID-19 and subsequently continued in the NFDI4Health: First, a template for a consent form and a guidance document on the corresponding legislation have been provided by the Task Force COVID-19 to facilitate record linkage of individual health data [27]. Second, the Task Force COVID-19 actively sought for opportunities to discuss the need for record linkage of health data with policy makers. It initiated a white paper on this topic by bringing together a broad range of health researchers and the community [28]. This activity was then taken up by the NFDI4Health (see Sect. 3).

3 Approaches for Upscaling

NFDI4Health built on the lessons learned and extended the activities and efforts of the Task Force COVID-19. In the following, we will briefly describe some of these advancements.

Metadata Schema

Based on the metadata model from the Task Force COVID-19 (see Sect. 2.2), a multidisciplinary group of experts within NFDI4Health further developed the metadata model and extended it for other domains in the health research aiming at a generic schema to describe metadata for any health-related studies. Collaborations and joint workshops between stakeholders from different health disciplines, IT experts and data modelers led to the addition of further data items, and value sets, as well as to agreed definitions that meet the target communities’ needs. Taking particular requirements from the NFDI4Health community into account, domain-specific metadata modules, e.g., for nutritional epidemiology, for chronic diseases, or for record linkage, were developed and added to the schema (see Fig. 1). Further iterative refinements of the metadata model are based on implemented feedback mechanisms, both internally and externally.

In addition, the FAIRness of the metadata schema has been improved by mapping its elements and values to international semantic (SNOMED CT, ICD, NCIt, etc.) and syntactic (FHIR) standards after an analysis of different standards used in health and life sciences. FHIR is the most used information technology standard to store and exchange a wide range of data in health-related contexts. The research and public health domains are increasingly addressed as shown by the different areas of application [29], the current use in other research initiatives, and major changes of the FHIR resource ‘ResearchStudy’ in the last and upcoming releases to encompass the requirements of the different health research communities. To render the metadata model, which was initially designed in a spreadsheet file, findable and accessible, it was modeled in ART DÉCOR [30], an open source collaborative platform to create, describe, manage and deploy health data models. Additionally, we created logical models, profiles and value sets along with an implementation guide to represent the metadata schema in FHIR and published these on SIMPLIFIER [31], a widely used publishing platform and registry of FHIR projects. Both tools enable users to browse through the metadata model online, and thus allow for a user-friendly representation and clear visualization of the metadata schema. The further FAIRification process of the metadata schema is assessed on the basis of the criteria proposed by the Research Data Alliance (RDA) [32]. An important future development will be the elaboration of interfaces between different standards to facilitate the exchange of information and to not exclude potential users.

German Central Health Study Hub

Since its launch, the COVID-19 Central Health Study Hub has been continuously improved and expanded based on user feedback. Various measures were taken to upscale the Health Study Hub to serve the needs of the health research community (see Fig. 2). For instance, SEEK was replaced by Dataverse [33] for (meta-)data management and standard publication workflows. To provide an interconnected platform, the user front-end now combines the Dataverse software and a tool for browsing instruments (see Sect. 2.3). To date, 1926 studies, instruments and other documents have been made available on the Health Study Hub. International connectivity is pursued by linking our national services to international platforms (e.g., the European Health Information Portal [34]).

Structured Search Tools

Given the viability of the Maelstrom approach for semantic annotation, we initiated a close collaboration with the Maelstrom Research Group to annotate further major studies. This approach allows a structured search for important instruments of major health studies, such as the Study of Health in Pomerania (SHIP) [35] and the IDEFCIS/I. Family cohort [36], that are now accessible on the Health Study Hub. To support multi-hierarchical ontologies such as SNOMED CT and fully display these annotations, further developments of the underlying software components OPAL and MICA will be necessary.

Quality Assessment

The R software package dataquieR was further elaborated, mainly with regard to the scope of metadata to control automated data quality assessments. For this purpose, the potential of attributes in existing standards and data models to handle information of relevance to control data quality analyses was systematically assessed. We targeted CDISC Define-XML [37] HL7®FHIR® [11], REDCap [38, 39], the OMOP Common Data Model (OMOP—OHDSI) [40] and OpenClinica [41]. This assessment illustrated that these standards are rather basic in their capability to hold attributes of relevance to data quality assessments. The high interest of our target community in dataquieR is reflected in more than 15,000 downloads since 2020.

Training

NFDI4Health worked on general training courses for research data management and data science such as Data Train and hands-on virtual and on-site training sessions, e.g., for data stewards. For more details we refer to the “Handbook: Training concepts in research data management and data science with focus on health research” which has been compiled by NFDI4Health [42].

Record Linkage

Under the leadership of NFDI4Health, several health researchers and the user community, including cancer registries, the Medical Informatics Initiative and the Network University Medicine, jointly worked on a white paper [28] to (i) demonstrate the current status of record linkage of personal health data for research purposes in Germany by ten use cases from different health research areas, (ii) obtain a comprehensive overview of approaches for linking different data sources and (iii) provide recommendations to improve possibilities for record linkage in heath research in Germany. Recommendations include the introduction of a unique identifier and building a decentral infrastructure with central components as, e.g., an approval agency and a third-party trust center (see Fig. 3 for a schematic illustration) which would overcome the challenges identified in the record linkage use cases. The white paper targets in particular decision makers to stress the urgent need for a better regulation of record linkage of personal health data and for the necessary adaptation of German legislation. A first step in this direction is taken by the planned German Law on Health Data Use (Gesundheitsdatennutzungsgesetz [43]).

Fig. 3
figure 3

Simplified example of a possible request and data flow for record linkage of health data from two decentral data holding organizations (DHO) including the central components: central access point (CAP), approval agency (AA) and third-party trust center (TTPC) (ID: encrypted unique identifier). The CAP acts as single point of contact for researchers and forwards the record linkage request to a use and access committee of an AA. Once approved the CAP forwards the request to a TTPC and the respective DHOs who perform the record linkage to provide analysis datasets to the researcher in an analysis center

4 Discussion and Conclusion

The COVID-19 pandemic has been sudden and unforeseen, posing enormous challenges to individuals, society, medical care and the scientific communities. Therefore, the Task Force COVID-19 was established in a very short time. It contributed to FAIR epidemiological, public health and clinical research with several activities from the earliest days of the pandemic. It created a core of data models, tools, infrastructure, and services that have been taken up and upscaled by the NFDI4Health to become widely applicable within our target communities. In addition, the Task Force COVID-19 developed models in close collaboration with other German research groups to predict the evolution of the pandemic situation. These models have been used to describe the current situation and demonstrate the potential impact of political actions and human behavior.

Our developments are expected to serve as an important source of information in the post-pandemic era. The COVID-19 Central Health Study Hub was set up in a timely manner to manage the large number of COVID-19-related studies and instruments used therein. The study hub may serve as a directory to easily and quickly retrieve the necessary study information to promote more harmonized research activities. The metadata model stipulates discussions on how to elaborate and store information on studies in the targeted fields of research, thereby creating an opportunity for a standard to describe research. The model is currently discussed with other life science communities, such as for example genomics (The German Human Genome-Phenome, GHGA), or social sciences (KonsortSWD), but also with different initiatives that build study repositories (e.g. the European Clinical Research Infrastructure Network, ECRIN) to facilitate the joint use of information. Mapping metadata elements to a common and widely accepted standard, namely FHIR, may prove to be a key asset for interoperability with other platforms and for the long-term sustainability of the German Central Health Study Hub. While different standards have been considered for the European Health Data Space (EHDS), e.g., in studies requested by committees of the European Parliament, none has specifically been chosen yet. However, FHIR might play an important role in the EHDS especially for the exchange of electronic health records (EHR). As EHR are also increasingly used for the recruitment of study participants or as a secondary data source, building an ecosystem based on a common standard would allow to bridge clinic and research. In addition, to accommodate a wide range of potential users, we need to take other models such as OHDSI’s OMOP into account.

Better understanding the legal basis for data use and linkage may improve the focus of scientists to successfully carry out such activities. The consent form could serve as a template for further studies. The results of the project’s activities regarding data quality assessments may substantially foster reproducible science by increasing the scope, comparability and transparency of these assessments. NFDI4Health has also started activities to raise awareness to potentials of FAIR research in its community through numerous workshops, symposia and talks at community-specific conferences.

Despite all efforts, there were a number of hurdles needed to overcome. One of these is content creation. Extracting metadata from literature or web pages is a laborious process that does not scale. Qualified staff is necessary to ensure high quality entries in a central search hub, especially if semantic standards are to be adhered to in a consistent manner. The willingness to upload material cannot be taken for granted and strongly depends on efficient processes and sufficient support. It may be difficult to get started, as the benefits will only become apparent in the medium term, when the amount of information is sufficient to serve the community on a regular basis. Challenges also arise, for example, when material is only available in German, but English is desired for international cooperation. Also, options to licensing and DOI allocation are hardly known in the target communities and need to be made aware of by training, as the benefits will only be realized in the medium term. Substantial difficulties may arise from consenting common standards, as the target communities do not necessarily share the same scientific language. Such consensus procedures have proven to be lengthy and may not always lead to a consensus. While the manual curation of content may not be evitable if metadata need to be set up for the first time for some study or related resources, we will emphasize co-operations with services that have curated vast amounts of medical metadata at the study or at the item level. Examples are the Portal for Medical Data Models (MDM) [44], the Maelstrom Research Catalogue [6], or the euCanSHare catalogue [45]. This can be realized through interfaces between the used data models to store metadata as has previously been showcased by the integration of heterogenous COVID-19 resources in a knowledge graph [46]. In particular, for the MDM an interface to CDISC ODM would be necessary. Incentives for the participating partners could be making curated metadata mutually available as each portal has unique metadata features of relevance for some target community with varying metadata elements.

Based on the lessons learned, Table 1 summarizes recommendations to support future network activities on FAIR research.

Table 1 Recommendations to foster FAIR research

In addition to the above recommendations, there is a need for cultural and political change. Scientific achievements should not only be measured based on scientific results published in peer-reviewed journals but also, for example, by shared und successfully used datasets, services provided to the community, transfer of results to the public or politics, and training of early career researchers. In particular, the sharing of data and tools should contribute to the scientific reputation of scientists. Awareness will be strengthened by the German Law on Health Data Use [43] and the European Health Data Space [47].

Concluding, the recommendations summarized in Table 1, raising awareness for the FAIR principles and striving for a cultural change may help to foster further collaborations among the various stakeholders and actors to eventually achieve that data sharing will become a natural concomitant under one National Research Data Infrastructure in Germany.