Keywords

1 Introduction

In recent years, there has been an increasing interest in evidence-based computer-aided approaches to address global challenges (GC). It has led to a rapid formation and methodological development of the domain of computational global systems science (GSS) which strives to provide “scientific evidence to support policy-making, public action, and civic society” [10]. Accurate digital twinning of the GC inevitably results in computationally expensive coupled simulations comprised of multiple models for diverse social and physical phenomena, often multiscale by design. These simulations assemble not only different models, but also various sources of massive static and streaming data sets. In effect, accurate numerical treatment of the GC requires combining up-to-date theoretical developments in GSS with high performance computing (HPC) and high performance data analytics (HPDA) technologies.

The HiDALGO project – HPC and Big Data Technologies for Global Challenges – aims to bridge the gap between traditional HPC and data-centric computation with a view to provide efficient technological solutions for accurate policy-making in the domain of GC. The project’s aim is not merely to develop the mechanisms for collecting and processing data by scalable methods on analytics and simulation levels. With the strong focus on highly accurate elaborated models, HiDALGO builds up coupled simulations and integrates real-world data in static simulations. The latter yield scientific insights into comprehensive processes and their corresponding phenomena, which occur in the contemporary world and directly affect the quality of human life. This research is driven by three representative case studies: human migration from conflict zones, air pollution levels in the cities, and the spread of malicious information in social media (such as Twitter and telecommunications networks).

In practice, when it comes to coupling of models and data sources for GC simulations in HPC environments, one faces a number of technical challenges, not common for traditional engineering and science applications for HPC. In particular, HPC environments typically operate on static data, which is already available on efficient parallel distributed file systems such as Lustre, BeeGFS, or OrangeFS [5]. Moreover, HPC data centers usually follow a set of security guidelines to isolate users and data which restrict access to incoming data from external sources. However, with the intent to reflect current and upcoming situations for policy-making, GC simulations extensively use recent data from external sources including streaming data coming from physical sensors, social networks, and mobile operators. Consequently, HPC system’s operation must be revised in order to allow highly complex simulation, but at the same time, provide the necessary flexibility to incorporate influx data.

Another common challenge stems from the necessity to couple simulations and exchange data across data centers in GC simulation scenarios. This demand is introduced by various reasons including endeavor to reuse time-consuming simulation results in multiple use cases, inability to share massive amounts of data, issues related to licensing the data or simulation software available at different data centers.

1.1 Related Work

Multiscale modeling has a long history of research. Weinan et al. [27] build a preliminary theoretical foundation for the multiscale modeling approaches and highlights key problems in this area. Hoekstra et al. [17] present an epistemological review on the introduction and significance of multiscale modeling and simulation in the interdisciplinary research. They further identify gaps in the current state-of-the-art of multiscale computing and recommend ways to fill these gaps. Zhang et al. [28] presents a study on the complex dynamic bio-system of brain cancer, using a multiscale agent-based modeling (ABM) approach.

Multiscale model coupling is well covered in the literature. Groen et al. [15] present a multiscale survey with a classification of applications in different research domains including: astrophysics, environment, engineering, materials science, energy and biology. They also review a set of multiscale models coupling tools with respect to the domains, approaches, and platforms. Crooks and Castle [9] focused on developing geospatial simulations using the ABM paradigm. They reviewed methods of multiscale coupling of geographical information systems (GIS) with ABM and compared a selection of toolkits which allow such integration. Tao et al. [25] present a multiscale modeling framework (MMF) that deals with the complex issues of weather simulation and discuss several coupling strategies for the sub-systems. Lastly, Gomes et al. [12] review methods and techniques of ‘Co-Simulation’, a technique that allows different subsystems to be coupled and simulated in a distributed manner.

A full review of individual frequently used coupling tools is beyond the scope of this paper, and comprehensively done by Groen et al. [14]. However, by means of example we do highlight a few key tools here. Borgdorff et al. [4] present the Multiscale Coupling Library and Environment (MUSCLE2), which facilitates the coupling and execution of submodels in cyclic multiscale applications. The FabSim [13] toolkit aims to simplify a range of multiscale computational tasks for a diverse range of application domains.

Modern surveys of the tools for scientific workflows definition and management [1, 20, 22] cover a broad spectrum of state-of-the-art solution – from mature software, evolved for several decades (like ASKALON, Galaxy, Kepler, Pegasus, or Taverna), to relatively new active developments (like Apache Airavate). Most of the tools stem from the former Grid solutions, where only DIRAC [6] and Airavate [21] support execution on Grid, HPC, and Cloud resources simultaneously. However, DIRAC suffers from poor flexibility, while Airavate does not follow any industrial standard in defining workflows [6]. At the same time, Cloudify [8] is a mature extensible tool for Clouds following industrial strength OASIS TOSCA standard [3]. Croupier orchestrator developed recently extends out-of-the-box Cloudify functionality to the HPC and HPDA environments [6].

1.2 Contributions of the Paper

Despite a broad coverage of different topics, existing literature on the multiscale model coupling barely touches the aspects related to the aforementioned technical challenges of coupling in HPC environments – involving external data sources into the static simulations and coupling across data centers. By introducing a generalized GC simulation workflow, this paper demonstrates a commonality of these technical challenges for various GC and reflects on the approaches to overcome them implemented in the HiDALGO project.

In Sect. 2, we briefly introduce HiDALGO case studies. These representative global challenges comprise a baseline for development of the generalized workflow for GC simulation scenarios, which is defined in Sect. 3. Section 4 is concerned with the implementation of the workflow. This is a core section of the paper, where we present the high level HiDALGO architecture, which implements the generalized workflow, as well as outline HiDALGO approaches to address coupling in HPC environments. Finally, in Sect. 5, we offer a conclusion, including a research and development outlook.

2 Representative Global Challenges

2.1 Human Migration

There are more than 70.8 million people forcibly displaced worldwide. Among them 25.9 million are refugees, half of which fled from Syria, Afghanistan and South Sudan [26]. These fleeing individuals are the unfortunate victims of internal armed conflicts and civil wars, who make decisions to migrate at the times of distress. Their decisions are often based on economic and political push-pull factors in sending and receiving countries. Researchers have mostly investigated why human migration occurs and what effects it has on economies using migration theories and econometric models. However, there is not an appropriate method or model to predict forced displacement counts or movements. In addition, existing models are largely based on regressing existing forced migration data, limiting their predictive power with incomplete or short datasets. Thus, we have created a simulation development approach (SDA) that allows us to forecast movements of forcibly displaced people in conflicts.

We run our simulations using the parallel Python-based Flee codeFootnote 1. It relies on an ABM approach with a location network graph, where each refugee in the simulation is represented by a single agent and location network graph reflects geographic information system (GIS) component of the model. In countries with very few roads and in mountainous areas, such as South Sudan, we explore key walking routes, increase the level of detail, and incorporate a broader range of relevant phenomena, such as weather conditions, communications, and food security, through model coupling (cf. Fig. 1).

Fig. 1.
figure 1

South Sudan location network graph semi-automatically extracted from Bing and OpenStreetMaps with the HiDALGO GIS toolkit [24]: red vertices stand for conflict zones (refugee depart here), dark green vertices correspond to camps (refugee tend to finish their journeys here), yellow vertices stand for regular locations (refugee may stay or move elsewhere), and light green vertices stand for forced redirection points (representing government-orchestrated movements). This location network includes walking routes allowing to increase resolution of the baseline agent-based model and to couple it with the models for weather conditions, communications, and food security. (Color figure online)

Our new approach is important due to three main reasons. First, it helps to forecast forced displacement movements when a conflict erupts, guiding decisions on where to provide food and infrastructure. Second, our approach provides forced displacement population estimates in regions where existing data is incomplete, to help prioritize resources to the most important areas. Finally, we can investigate how border closures and other policy decisions are likely to affect the movements and destinations of forced migration, to provide policy decision-makers with evidence that could support more effective policy and reduce unintended consequences. We have already successfully simulated forced displacement movements in Burundi, Central African Republic, Mali and South Sudan using this prototype, and compared our forecasts with the United Nations High Commissioner for Refugees (UNHCR) refugee camp registrations [24].

2.2 Urban Air Pollution (UAP)

NO\(_{2}\) is one of the most severe pollutant in urban areas, of which main producer is the vehicular traffic. The level of air contamination can be significantly reduced by proper control on the urban traffic and elaborate architectural development of the cities. UAP case study aims to understand, model, and simulate the spread of vehicular air pollution in the cities.

In order to improve accuracy of the results, we couple detailed computational fluid dynamics (CFD) simulation of the multicomponent air flow in the cities with agent-based traffic simulations, real time traffic and air quality sensor data, and meteorological prediction data. In our experiments with CFD simulation, we use a wide range of software packages including OpenFOAM, NEK5000, Fenics-HPC, and ANSYS Fluent, while traffic simulation part is done with SUMO package [2]. The geometry for meshing of the city is extracted automatically from OpenStreetMaps. For better accuracy, in our CFD simulations, meshes reach the resolution up to 1 m. In this way, we obtained accurate simulation results for the cities of Györ (Hungary, cf. Fig. 2), Graz (Austria), Stuttgart (Germany), Poznan (Poland), and Milwaukee (WI, USA) [18]. These results are further used for model reduction, which allows to lessen computational demands.

Fig. 2.
figure 2

Simulation for the spread of NOx in the city of Györ (Hungary) with the HiDALGO toolchain. NOx concentrations are illustrated via grey-scale point clouds, wind velocity by black arrows, and air path lines by blue lines [18]. (Color figure online)

2.3 Social Network Analysis (SNA)

SNA case study aims to understand, model, and simulate the spread of messages in various social networks. To achieve this goal, several scientific problems have to be addressed, such as the structural and algorithmic properties of the underlying graphs, the stochastic characteristics of information dissemination, and the interplay between these aspects. Furthermore, several programming related points have to be taken care of as well, e.g., the deployment and efficient simulation of the processes described above.

To understand the structure of social networks, we analyzed different types of graphs, which are accessible through SNAP datasets [23] or can be obtained by crawling real world social networks such as Twitter. Apart from well known characteristics, such as degree distribution, clustering coefficient, or distances, one major goal is to understand the size and the structure of the communities in these networks. For this, we apply different clustering algorithms to obtain local partitions of the direct neighborhoods of the nodes. Then, the resulting clusters are merged to obtain an overlapping partitioning of the graph. These results are used to derive synthetic graph models, which are able to properly describe social networks.

To model and simulate the spread of messages, we will analyze different data sources, including tweets from Twitter, and couple them with other real world data such as duration and geographic properties of telephone calls. In our simulation framework, we will integrate the different models from these data sources, and combine them with the synthetic models of the social networks.

3 Generalized Workflow

The GC discussed above were used to construct a generalized workflow for simulation scenarios (cf. Fig. 3), which establishes a common ground and reflects key functionalities for all presented case studies. Here, we briefly outline this workflow. For further details on derivation of the generalized workflow from the individual GC workflows, we refer the interested readers to [11].

The Data Source phase denotes the starting point into the workflow including local files as well as external APIs. From these sources data are extracted in the phase of Data Extraction. The extracted data can optionally be complemented in a Synthetic Data Generation phase which focuses on enriching existing data as well as generating new data. To give an example, in this phase a 3D city model might be constructed based on geospatial information.

The Data Processing & Feature Extraction phase prepares the data for the Model Generation phase. It transforms the data into standard formats and then seeks to extract relevant information by applying methods from, for instance, signal enhancement, representation learning, or dimensionality reduction.

Fig. 3.
figure 3

The generalized workflow in the HiDALGO project.

Visualizations support the validation of data and processing steps at several points of the workflow. The following Model Generation phase generates the actual model for the simulation. The generation process is based on the pre-processed data and might optionally incorporate external models, for example, dispersion models for weather forecasts.

The Simulation phase employs the model to generate results which are then examined in an Analysis phase. Validation is performed by comparing the simulation results to a ground truth, if available. In addition, Visualizations of the results support the validation process. The Analysis phase may directly as well as indirectly influence both the Data Processing & Feature Extraction as well as the Model Generation phase, for example, a model is updated due to inappropriate predictions. In rare cases the analysis might also impact the data sources, i.e. data sources might be altered (e.g. add a new one), if the results of the analysis are not satisfying.

4 Implementation

4.1 High-Level Architecture

Figure 4 illustrates the high-level HiDALGO architecture. In order to establish computational infrastructure for the implementation of the generalized workflow, HiDALGO involves resources of the three leading European HPC centers – High Performance Supercomputing Center Stuttgart (HLRS), Poznan Supercomputing and Networking Center (PSNC), and European Centre for Medium-Range Weather Forecasts (ECMWF). HLRS and PSNC provide HPC and HPDA clusters and clouds for general purpose applications. Their resources serve to run simulation and data analytics components for various needs of the GC case studies. ECMWF grants access to its cloud facilities to enable distributed, highly efficient computing. While not giving direct access to its HPC systems, which are closed to external access as it is used to produce time-critical twice-daily global weather forecasts, ECMWF provides the migration and UAP case studies with the forecast model’s output, as well as climate, atmospheric, and hydrological data via the cloud, which necessitates coupling across data centers.

Fig. 4.
figure 4

High-Level HiDALGO architecture

Besides crawled data from Twitter social network for SNA case study, the streaming data for all three case studies comes from MoonStar Communications GmbH (MOON), an international telecommunication company, as well as from ARH Inc. (ARH) and Hungarian Public Road Nonprofit Pte Ltd. Co. (MK), enterprises collecting data from the sensors and traffic lights for the city of Györ.

The overall workflow implementation is orchestrated and monitored by Croupier orchestrator which extends out-of-the-box Cloudify orchestration functionality to the HPC environments [6]. This toolkit provides a unified way to describe complex workflows with YAML files based on OASIS TOSCA standard [3] and to launch them via appropriate scheduling of coupled simulations, data analytics, and data management operations on HPC, HPDA, and cloud resources of multiple data centers at the same time.

Weak and strong coupling mechanisms within the same data center are implemented by means of the tools for coupled simulations (like FabSim) mentioned in Sect. 1.1. Comprehensive Knowledge Archive Network (CKAN) maintains data management operations [7]. In particular, CKAN serves as a single entry point for external static and streaming data, connecting these data sources with HPC and HPDA resources. Apache Kafka enables smooth integration of the streaming data with CKAN and further with the infrastructure of HPC centers [19]. Finally, HiDALGO introduces a custom REST API implementation for coupling with weather and climate data and simulations across data centers. The next subsections explain further details about current state and future plans regarding implementation of coupling mechanisms in HiDALGO.

4.2 Orchestrator and Monitor

The coupling requires that not only the software pieces are synchronized, but also the resource allocation is done in line with the coupling activities. When a ABM simulation needs to be coupled with a CFD simulation, the coupling mechanism must be taken into account in order to facilitate an optimal performance. If we require cyclic coupling, but we run the simulations in different HPC centres, the performance in the messaging part will become a bottleneck, slowing down the whole application.

HiDALGO proposes to use an orchestrator called Croupier [6], which extends Cloudify [8] supporting the usage of HPC resources (by means of plugins able to connect to popular HPC workload managers such as Slurm and Torque). It offers flexibility in combining Cloud and HPC resources. This functionality is used in implementation of the mechanisms for coupling across data centres as described in Sect. 4.5.

The Croupier orchestrator uses the OASIS TOSCA standard [3] to specify the workflows. As described in [6], some extensions were proposed in order to support HPC jobs in the workflow definition. Moreover, it is possible to define dependencies between tasks in the relationships section, by means of the job_depends_on type. This feature allows us to indicate that several tasks represent a coupled simulation and, therefore, the resource allocation should be automated to optimize overall performance. The next step is to adapt that section in the workflow, expressing the type of dependency for specifying the cyclic coupling mechanism: job_mpi_coupled_with, job_data_coupled_with, etc.

It is important to highlight that the coupling has also an impact on the monitoring aspect, since data about the coupling activity itself should be collected, in order to identify potential bottlenecks.

4.3 Coupling Mechanisms for Locally Simulated Models

We distinguish two types of coupling mechanisms:

  1. (i)

    Acyclic coupling refers to the mechanism where the results from the execution of sub-codes are in turn used as input for the execution of the subsequent sub-code. In acyclic coupling sub-codes are not mutually dependent during execution.

  2. (ii)

    Cyclic coupling deals with the simulations where the sub-codes are mutually dependent on each other. Cyclic coupling is further divided into: (a) sequential cyclic coupling: where (two or more) sub-codes execute in alternating fashion, such that output of one is used as input by the other and vice versa; and (b) concurrent cyclic coupling: where two (or more) sub-codes are executed and exchange their outputs to each other concurrently.

In migration case study, we rely on acyclic coupling for the integration between the conflict model and the migration model, as well as between the migration model and any validation activities that are performed (e.g., by comparing to telecommunications data). We also have several cases of cyclic concurrent coupling, particularly in the integration between a more coarse-grained national model, and the more refined local model (which is in turn coupled to weather data from the Copernicus Project).

UAP case study also draws on acyclic coupling for the integration between CFD model of the multicomponent air flow, traffic model, and meteorological forcasts, as well as for incorporation of streaming data from traffic and air quality sensors into the static simulations.

Likewise, SNA case study exploits acyclic coupling mechanism to couple simulations with streaming data from Twitter and static telecommunications data.

4.4 Coupling with External Data Sources

The Data Management System (DMS) is an indispensable part of the generalized workflow implementation. It is assumed as heavily utilized module of the system by many applications on various stages of scenario execution for reading input data and saving outputs from computing. Furthermore, since parallel computing comes into play where simultaneous operations are executed on the same storage resource, the framework must support distributed and parallel operations as well. It drives us to two main problems which should be solved in the proposed DMS framework: efficient data management and reliable computation.

CKAN is a software of choice for maintaining various data management operations in the HiDALGO system [7]. The CKAN is capable to extend its current functionality by applying a number of plugins. This feature is widely explored by HiDALGO developers in order to offer extra functionality which address specific user demands.

In order to ensure the consistency in the data harvesting and processing method and an adequate level of security in HPC environments, data must be delivered first to the DMS. DMS is connected to the application orchestrator available in the portal and defines appropriate links to the data under workflow definition. In HiDALGO, it is assumed that data can be delivered to the DMS in the three different ways:

  • files – data prepared by user as standalone files and uploaded to the CKAN from local workstation;

  • links to external data source – user provides a link to the data located in public area, in the next step data are transferred to the CKAN along with standard procedure;

  • profiled harvester – in this case the functionality of the CKAN is extended by the plugin which enables access to specific external source using dedicated API and/or providing complementary functionality (e.g., data conversion).

There is only one exception when data do not need to be delivered to the DMS before processing in the HiDALGO project, data are sourced from trusted entity. In this case, only link to external data is created in the CKAN system. One of the example of the trusted entity is ECMWF’s data store.

Whenever we need to deal with constantly incoming data (e.g., tweets, traffic data), streaming services must be included as a part of the processing workflow. These types of systems ensure low-latency, high-throughput platform with unified interface, which enables connecting to various external sources. For this purpose, in the HiDALGO project, Apache Kafka [19] system was selected. This is a distributed open-source solution which facilitates building real-time data pipelines for streaming-oriented use cases.

In the HiDALGO project, we are considering streaming data of two different types: (a) camera based traffic data, and (b) air quality monitor based pollution data. The streaming data will be gathered from the multiple sites around the City of Györ to ARH Data Center Server’s GDS middleware software, which will be located at the University of Györ (SZE). SZE will be responsible for coupling these two data streams. For accessing this streaming data from GDS, SZE will be interfacing through GDS’s standardized interface: MultiInterface API.

4.5 Coupling with Weather and Climate Data Across HPC Centres

The purpose of coupling with ECMWF’s weather forecasts, climate reanalysis, global hydrological, and Copernicus Atmosphere Monitoring Service (CAMS) data within HiDALGO is to enable the migration and UAP case studies to explore improving their simulation models through the inclusion of such data. The overall strategy for the coupling of weather and climate data consists of two stages: (a) static coupling of climate reanalysis data, (b) dynamic coupling via a RESTful API.

To enable coupling with climate reanalysis data each relevant case study has access to the Copernicus Climate Data Store (CDS). Both case studies are using ERA5 [16], ECMWF’s latest global atmospheric reanalysis which extends back to 1979. ERA5 is produced daily with a 5 day lag time. Final post-processed data has been manually inserted into CKAN by ECMWF and raw CDS data has been inserted using a custom tool developed by PSNC.

ECMWF will develop a custom Weather and Climate Data API (WCDA) to enable the delivery of weather forecast and climate data to the relevant GC applications running on another HPC centers. The WCDA will be implemented as a RESTful API, using ECMWF’s Polytope software, and follow industry norms in terms of GET/POST/DELETE operations. Due to the sizes of data involved, requests to the WCDA may take a considerable time to fulfill (ranging from minutes to days). As a result, download requests to the WCDA will be asynchronous, returning a pollable URL where the client may check the progress and finally retrieve their data.

Due to its RESTful API design the WCDA will enable GC applications to directly submit requests and retrieve the required weather and climate data so long as they can access the WCDA end-points; for the lifetime of the project it is planned to open dedicated ports at HLRS and PSNC to enable this. If this approach is not possible, a fall-back solution is to have a trusted service running at PSNC on a node with internet access which will retrieve data from the WCDA and insert it into CKAN temporarily; the orchestrator will then manage moving this data to the relevant compute node where the GC application is running.

Fig. 5.
figure 5

Preliminary technical design for Polytope, hosting the Weather and Climate REST API

Figure 5 shows a preliminary technical design for Polytope, which will host the Weather and Climate REST API. ECMWF has a flexible object storage solution which is well optimized for weather & climate data, called FDB (Fields Database). The FDB is an internally provided service, used as part of ECMWF’s weather forecasting software stack. It operates as a domain-specific object store, designed to store, index and serve meteorological fields produced by ECMWF’s forecast model. The FDB serves as a ‘hot-object’ cache inside ECMWF’s high-performance computing facility (HPCF) for the Meteorological Archival and Retrieval System (MARS). MARS makes many decades of meteorological observations and forecasts available to end users. Around 80% of MARS requests are served from the FDB directly, typically for very recently produced data. A subset of this data is later re-aggregated and archived into the permanent archive for long-term availability. Usage of the FDB will allow the WCDA to meet the requirements of data sizes (MiBs to TiBs) and will perform well even with heterogeneous data sizes.

Different collections of data (e.g. forecast data, climate data) will be represented as separate end-points to the WCDA. HTTP requests to each end-point will include authorization metadata in the header and data-specific metadata in the message content. Data will be indexed using multiple scientifically meaningful keys. Endpoints will also be provided to query collection contents and their data schemas.

5 Conclusions and Future Work

Underpinned by the three representative global challenges, the paper derives the generalized workflow targeting these case studies, but also applicable to a wider range of GC. Implementation of the workflow is associated with a number of technical challenges related to coupling. In particular, the paper underlines two aspects of coupling in HPC environments which has not received much attention in the scholarly literature: i) involving external data sources into the static simulations and ii) coupling across data centers. It outlines approaches to circumvent these technical challenges proposed in the HiDALGO project. These approaches rely on extending widely used workflow and data management tools such as Cloudify, CKAN, and Apache Kafka, as well as implementing REST-based mechanisms for coupling across data centers. Moreover, our solution for integrating streaming data in HPC environments can contribute to expanding the use of HPC systems on the domains of Internet of Things and Industry 4.0.

Future research and developments seen from an applied perspective may focus on the following directions: i) further development of the mechanisms for moving large datasets in HPC environments, ii) improvement of the mechanisms for acyclic coupling across data centers and, in particular, enhancement of WCDA, iii) implementation of the strong coupling mechanisms (via messages passing) in the representative case studies, as well as iv) performance evaluation for the proposed solutions.