Keywords

1 Introduction

Business reporting is a strategic but heavy activity defined as “the public reporting of operating and financial data by a business enterprise,” [1] or the regular provision of information to support decision-makers within organizations. Reporting is a fundamental part of the business intelligence and knowledge management activity and it is strongly required by audit organizations. Reporting activity can be realized in an ad hoc way by means of specific and complex softwares, or by involving typical operations of extracting, transforming, and loading (ETL) procedures in coordination with a data warehouse.

Reports can be distributed in printed form, via email or accessed via a corporate intranet. In sectors as banking, reports are required by both National and European Central Bank organizations on regular basis, and all required reports must comply specific templates provided by the organizations themselves.

In particular, reports for auditing are often very specific, and their structure is usually imposed by the supervising organizations (e.g. European Central Bank, or the rating agency Moodys). The data included in the report are, in most cases, not useful for decision making activities due to the “control” nature of these reports. As a consequence, companies are forced to develop complex systems to compute data that are not useful for their business activities. In this context, it is clear that the need to develop a new approach able to support, in a fast and efficient way, the generation of reports is compelling.

In this scenario we propose to adopt a data warehouse as storage system for data, but we introduce a new approach aimed at designing the multidimensional models on the basis of the structure of the report itself in a (semi-)automatic way, in order to reduce the time needed to produce the report. A data warehouse essentially combines information from several heterogeneous sources into one comprehensive database. By combining all of this information in one place, a company can analyze its data in a more holistic way, ensuring that it has considered all the information available. Data warehousing also makes possible data mining, which is the task of searching for patterns in the data that could disclose hidden knowledge.

At the basis of a data warehouse lies the concept of the multidimensional (MD) conceptual view of data. The main characteristic of the multidimensional conceptual view of data is the fact/dimension dichotomy, which represents the data in an n-dimensional space. This representation facilitates the data interpretation and analysis in terms of facts (the subjects of analysis and related measures) and dimensions that represent the different perspectives from which a certain object can be analyzed.

Even if data warehousing benefits are well recognized by enterprises, it is well known that the warehousing process is time consuming, complex and error prone. Today the increasing reduction of the time-to-market of products forces enterprises to dramatically cut down the time devoted to the design ad the development of MD models, which support the evaluation of the key performance indicators of services and products.

There are different ways to design a data warehouse and many tools are available to help different systems to “upload” their data into a data warehouse for analysis purposes. However, all techniques are based on first extracting data from all the individual sources, by then removing redundancies and finally organizing the data into a format that can be interrogated.

As use case for presenting our approach we propose securitization, which is known by the literature as the financial practice of pooling various types of contractual debt such as residential mortgages, commercial mortgages, auto loans or credit card debt obligations (or other non-debt assets which generate receivables) and selling their related cash flows to third party investors as securities [2]. Mortgage-backed securities, which are the case study presented in this paper, are a perfect example of securitization. By combining mortgages into one large pool, the issuer can divide the large pool into smaller parts based on each individual mortgage’s inherent risk of default and then sell those smaller pieces to investors.

In this scenario we propose the MMBR (Multidimensional Model By Report) approach, which is able to automatically create the structure of a multi dimensional model (MD in the follow) and fill it on the basis of a knowledge base enriched with mapping information that depend on the specific application context. The preprocessing phase of the report (often a quite complex Excel file) is based on a table identification algorithm, which is able to extract the information needed to define the MD structure of the data warehouse. The approach has been tested in the context of financial data with the aim to automatically create the reports required by the Italian National Bank and by the European Central Bank.

The term “by report” refers to the capability of our solution to create a multidimensional model starting from a given report (typically expressed as Microsoft Excel files) that must to be filled with real data. MMBR is also able to generate the relational data structure related to the created MD, and it is also in charge of filling both fact and dimensional tables supported by domain ontologies and by mapping information to the operational sources.

In the literature there are many methodologies for creating MDs starting by requirements, but this is the first attempt to define an approach for creating a MD model starting directly from the structure of the final reports only.

The remaining of the paper is organized as follows: Sect. 2 introduces the state of the art. Section 3 presents the proposed approach, while Sect. 4 describes the knowledge base that is a key element in the MMBR methodology, and the report graph. In Sect. 5, the table identification algorithm is presented, while Sect. 6 describes the creation of the MD model. A real example taken from the financial domain is then reported in Sect. 7. Conclusions and final remarks are reported in Sect. 8.

2 Related Work

In the literature several approaches for creating conceptual MD schema from heterogeneous data sources have been presented. According to [3], these approaches can be classified into three broad groups:

  • Supply-driven: starting from a detailed analysis of the data sources these techniques try to determine the MD concepts. By using this approach, it is possible to waste resources by specifying unnecessary information structures, and by not being able to really involve data warehouse users. See for instance [4,5,6].

  • Demand-driven: These approaches focus on determining the MD requirements based on an end-user point of view (as typically performed by other information systems), and mapping them to data sources in a subsequent step (see for example [7, 8]).

  • Hybrid approaches: Some authors (see for example [9,10,11]) propose to combine the two previously presented approaches in order to harmonize, in the design of the data warehouse, the data sources information with the end-user requirements.

All the methodologies available in literature, however, have the goal to create a MD model as general as possible in order to allow the generation of any report. This assumption requires a lot of effort in both the warehouse conceptualization phase and in the ETL procedure design and development. In several industrial contexts, there is the need to produce a limited number of reports only and, sometimes, with a very strict and well defined structure due to auditing rules or for specific business requirements. In the finance domain, for example, banks are required by central authorities and rating agencies to produce very specific reports related to the securization activities they perform.

In the field of the Semantic Web, Bontcheva and colleague [12] present an approach for the automatic generation of reports from domain ontologies encoded in Semantic Web standards like OWL. The novel aspects of their so-called “MIAKT generator” are in the use of the ontology, mainly the property hierarchy, in order to make it easier to connect a generator to a new domain ontology.

Another interesting approach is presented in [13], where the authors propose a framework for designing a semantic data warehouse. They represent the topic of analysis, measures and dimensions in the requirements. In such an approach they derive the MIO (Multidimensional Integrated Ontologies) along with the knowledge from external ontology sources and domain ontologies. Nebot and colleagues [13] propose an approach in which a Semantic Data Warehouse is considered as a repository of ontologies and other semantically annotated data resources. Then, they propose an ontology-driven framework to design multidimensional analysis models for Semantic Data Warehouses. This framework provides means for building an integrated ontology, called the Multidimensional Integrated Ontology (MIO), including the classes, relationships and instances representing the analysis developed over dimensions and measures.

Romero and colleague [14] introduce a user-centered approach to support the end-user requirements elicitation and the data warehouse multidimensional design tasks. The authors explain how the feedback of a user is needed to filter and shape results obtained from analyzing the sources, and eventually produce the desired conceptual schema. In this scenario, they define the AMDO (Automating Multidimensional Design from Ontologies) method, aimed at discovering the multidimensional knowledge contained in the data sources regardless of the user’s requirements.

The implemented process derives the multidimensional schema from a conceptual formalization of the domain, by defining a fully automatic supply-driven approach working at the conceptual level. Differently from the idea implemented in this work, based on the report as starting point, they consider the queries as first. Such an identification comes from the categorization they introduced from a first analysis, that divides different contributions within a so-called demand “driven”, “supply-driven” or “hybrid” framework. The first one focuses on determining the end-user multidimensional requirements to produce a multidimensional schema; the second one starts from a detailed analysis of the data sources to determine the multidimensional concepts in a re-engineering process. The latter refers to the approaches that combine the two previous frameworks.

Another interesting work aimed at supporting the multidimensional schema design is given by [15], in which the authors propose an extension of their previous work [16]. They follow a hybrid methodology where the data source and the end-user requirements are conciliated at the early stage of the design process, by deriving only the entities that are of interest for the analysis. The requirements are converted from natural language text into a logical format. The concepts in each requirement are matched to the source ontology and tagged. Then, the multidimensional elements such as fact and dimensions are automatically derived using reasoning.

On the other hand, Benslimane and colleague [17] define a contextual ontology as an explicit specification of a conceptualization, while Barkat [18] proposes a complete and comprehensive methodology to design multi-contextual semantic data warehouses. This contribution is aimed to provide a context meta model (language) that unifies the definitions provided in Database literature. This language is considered as an extension of OWL, which is the standard proposed by the W3C Consortium [19] to define ontologies. It is defined by the authors in order to provide a contextual definition of the used concepts, by offering an externalization of the context from the ontology side.

Pardillo and colleagues [20] present an interesting approach aimed at describing several shortcomings of the current data warehouse design approaches, showing the benefits of using ontologies to overcome them. This work is a starting point for discussing the convenience of using ontologies in the data warehouse design. In particular the authors present a set of situations in which ontologies may help data warehouse designers with respect to some critical aspects. Examples are the requirement analysis phase, where new concepts and techniques meaning should be clarified to be used by stakeholders, or the phase of reconciling requirements and data sources.

As also considered in this approach, it is important to underline that a domain specific ontological knowledge allows to enrich a multidimensional model in aspects that have not been taken into account during the requirement analysis or data-source alignment phases, as well as other aspects, like for example the application of statistic functions in order to aggregate data. Table 1 summarizes the main concepts explained into the literature reported by the above mentioned contributions.

Table 1. Related work summary.

3 Description of the Approach and Outline of the Architecture

The MMBR approach main phases are shown in Fig. 1: (1) Table Processing (TP), (2) Row and Column Header Identification and Extraction (RCHIE), (3) Ontology Annotation (OA), (4) Management of Non-Identified labels (MNL), (5) creation of the MD model, (6) ETL Schema Generation (ETL), and, finally, (7) the Report Generation (RG). The input of the TP phase is the template file that has to be filled with the data extracted from an Operational Data Base (ODB). In the TP phase the preprocessing of the template is performed by removing icons and other figures, moreover all terms in the schema are lowered and comment and description fields are removed.

Fig. 1.
figure 1

Overall representation of the approach.

The RCHIE phase is based on the table identification algorithm aimed at identifying and extracting the row and column headers in the template. The details of the table identification algorithm are presented in Sect. 5.

The list of terms recognized in the reports by the table identification algorithm is then annotated on the basis of a knowledge base (see Sect. 4). This phase produces two lists; the first one is the list of identified terms annotated w.r.t. the knowledge base, the second one is the list of terms that are not annotated. There are several possible reasons of failure for the annotation activity. The most frequent reason is that a given term may be not included in the knowledge base because it is not relevant to the domain (e.g. “Total”). It is also possible that a term is not annotated because it is a composition of different terms (such as “MortageLoan” or “DelinquentLoan”)Footnote 1. Moreover some terms are written in a language different from English (e.g. “garantito” that means guaranteed in Italian). In all these cases, not annotated terms are manually checked and, if relevant, added to the ontology by defining the corresponding rdf:label property. The annotated list of terms is the input for the creation of the dimensional fact model (see Sect. 6). This logical model is translated into a relational star schema. In this phase the relational database is filled with data coming from the ODB. This activity is performed on the basis of the mapping rules included in the knowledge base. This activity is fully described in Sect. 4. Once the data warehouse is filled, the report generation phase is in charge of populating the report template by translating annotation of the report graph into SQL queries executed over the data warehouse. Query results are then inserted in the report template to generate the final output.

Fig. 2.
figure 2

Representation of the overall architecture.

The architecture supporting the MMBR approach is represented in Fig. 2. The Annotation Editor is in charge of the first three phases of the MMBR approach, by removing non relevant strings and images from the input file (e.g. logo, comments), and by identifying the terms that are annotated w.r.t. the KB and by creating the report Graph. The Schema builder is the software component aimed at creating the logical relational description of the MD model. The ETL generator is in charge of extracting, on the basis of the Report Graph and the KB, the information necessary to create the extraction-transformation-load data from the ODB to the data warehouse. The Knowledge base manager is in charge of managing and evolving the knowledge base. Any popular tool as, for instance, ProtegeFootnote 2 may be used for the KB creation. The Report generator finally allows to fill the report template by capturing the data from the DW according to the queries build on the base of the annotation included in the Report Graph.

4 The MMBR Knowledge Base and the Report Graph

At the core of the proposed approach lies the creation of the knowledge base KB, which includes:

  • the set of MD concepts and relations (fact, dimensions, measures, attributes);

  • the list of terms adopted in the specific application domain (eg. ecommerce, bank securitization,...);

  • the Operational DataBase (ODB) schema.

In order to create a knowledge base that could be easily shared in the financial domain we started by using an already existing ontology and only in case of need we created new concepts.

The ontology we used as starting point for creating new MD concepts in the KB is a simplified version of the data cube vocabularyFootnote 3, i.e., a W3C recommendation for modeling multidimensional data. The top level representation of the defined KB is shown in Fig. 3.

Fig. 3.
figure 3

Top level representation of the knowledge base.

The MD concepts are organized as follows. A fact (the event that is the target of a report, e.g., a sell in a e-commerce domain, a loan in the bank domain) is described by a set of measures and can by analyzed by considering its dimension and descriptive attributes. In the data cube vocabulary dimensions, measures and descriptive attributes are described by the concept component properties instances. Dimensions, measures and descriptive attributes are terms of the application domain and they are defined by the human (domain) expert trough the Knowledge Base. In fact, the KB annotation specifies if a KB component refers to a fact, a measure or to a dimension. Such elements are then compared with each label extracted from the Excel file in order to define fact, measures and dimensions of the corresponding model. In order to build a KB related to the e-commerce domain, it is possible, for example, to use concepts described in the good relation section of the vocabularyFootnote 4. In this scenario instances of DimensionProperties are gr:ProductOrService, gr:Brand, while instances of dq:Measure are gr:UnitPriceSpecification, gr:amountOfThisGood. If no vocabulary is available, a new, ad-hoc, vocabulary has to be defined as first (as also reported in Sect. 7).

The concept qd:ComponentProperty can have one or more rdf:label properties associated to, that represent the references to the instances of the target concept. For example the dimension gr:Brand may be labeled as “NameOfProduct” or “BrandName”. During the annotation phase, labels are used to associate terms of the report to the application domain concepts.

In order to populate the MD model it is necessary to know how the qb:componentProperties are described in the operational DB (ODB). This mapping is described in the KB itself, by means of the c2t:mappingRule concept, which associates a c2t:mappingFormula related to a given instance of the qb:ComponentProperties. The c2t:mappingFormula contains a reference to some tables of the ODB and a query predicate over their tuples.

For example, in a bank scenario we can assume that the TLoan table of the ODB contains all information related to loans. A loan with a fixed rate (i.e., a loan where the interest rate on the note remains the same through the term of the loan) can be represented in the ODB by the predicate InterestRate = 1, while a floating rate can be described by the predicate InterestRate > 1. The formula c2t:mappingFormula includes the references to the TLoan table and the predicate regarding the InterestRate.

The concept c2t:context in Fig. 3 assumes value when the reports provided by different audit authorities contain different mapping formulas for the dimension dq:componentProperties. For example, a given audit authority may classify a company as “small” if the employee number does not reach 10, while for another authority a company is small if it has less than 15 people employed. In this case we will have two different c2t:MappingFormula.

4.1 Report Graph

Starting from such a schema the ontology has been defined by using the Protégé editor [21]. Protégé is a free, open source, ontology editor and a knowledge management system with an user friendly graphic interface. It also includes some classifiers to validate that models are consistent and to infer new information based on the analysis of an ontology [21, 22].

The report graph is a rdf representation of the report template, which includes the annotation of each value cell in terms of KB elements.

The top level of the ontology represents the description of the structure of the reports. It includes a set of qd:observation elements (i.e., each cell of the report), each of them characterized by the two properties (c2t:hasPosX and c2t:hasPosY) representing their coordinates in the report. An observation element may contain a measure (c2t:hasValue) if it contains values from the ODB aggregated by means of an aggregation operator (e.g. sum, avg...) as in the traditional data warehouse. Moreover an observation element includes the dimensions, which are used for the analysis. Both dimensions and measure are fully described in the KB. An example of a report graph annotation is as follows:

figure a

In the example, the cell whose coordinates are column “c” and row “9” will contain the sum of Outstanding Principal extracted from the loans that are at the same time Performing and with a mortgage guarantee.

5 Table Identification

Reports required by audit organizations are usually structured documents represented by tables. Each table can be divided into different areas or sections, according to their structure. Thus, being able to correctly identify the inner structure of the table is important to find the concepts relevant to the MD models generation. As discussed in the introduction, a multidimensional model represents the data into a n-dimensional space; under this perspective each report can be considered as one of the possible hyperplanes slicing the n-dimensional cube of data. To represent this hyperplane into a bi dimensional table it is necessary to reduce the dimensions. In Fig. 4 the MD is composed by three dimensions (time, nations and type of sold goods) that are “flattened” into a bi-dimensional space by associating the values of type of sold goods (Food and non Food) to the nation dimension. According to this assumption, rows and columns header may contain dimensions, values of dimensions and measures of the MD.

Fig. 4.
figure 4

Example of report.

In our approach, during the RCHIE phase a table was assumed as composed by three types of cell: respectively textual, data and schema ones. Figure 5 shows the general schema. The cell identifiers are represented by the couple \(<\!\!X,Y\!\!>\), as reported in the table shown in Fig. 5.

Fig. 5.
figure 5

Example of table used by the table identification algorithm.

The table may contain several types of cells, as defined in the following way:

  • textual-cell: this cell is not used for table annotation, these cells are shown in grey in Fig. 5, and they may contain simple text.

  • data-cell: it contains data that are computed on the basis of the MD model. These cells are represented by the white colour in the figure.

  • schema-cell: it specifies properties over a set of data-cells. It is shown in dark grey in the figure. This cell defines the header \(h\,=\,<\!\!x,y\!\!>\) of a set of data cells, by specifying some semantic aspects (i.e., the measure or a value on a dimension).

Rows and columns are identified in order to extract the labels corresponding, respectively, to measures, dimensions, instances of the dimensions, etc. (for instance not relevant information as the TOTAL value shown in Fig. 6). These labels represent the input of the annotation phase, which produces the annotated list of terms as output.

Fig. 6.
figure 6

Example of table.

In the literature different table identification algorithms aimed at handling the tables structure have been proposed [23]; in our work the focus is identifying and removing multi spanning cells. An example is reported in Fig. 6, where one of the reports related to securitization is shown. The Stub Header cell details information w.r.t. the measures Loan and Outstanding Principal of different types of companies, as Corporate, SME and “Impresa” (it refers to retail companies in the Italian jargon). Measures, names and instances of dimensions are placed in the Box Header and/or the Stub areas as headers, and they are used to index the elements located in the Body area of the table. The Stub Header may also contain a header naming or describing the dimensions located in the stub. The outcomes of the table identification algorithm are shown in Fig. 7 where all data-cells are semantically associated to their row and column headers.

Finally, the RCHIE phase extracts a list of unique terms that are in the column and row headers. These terms are then annotated by means of the knowledge base, by evaluating the labels related to the application domain concepts and the terms extracted from the report table. In this task, the domain expert are forced to take action only for those text strings that do not have the corresponding ontology term (an example is given by the string performing and the term in Bonis).

6 The Dimensional Fact Model

Generally speaking, the Dimensional Fact Model (DFM) [24] is a graphical conceptual model for data warehouse aimed to:

  • effectively support the conceptual project,

  • define an environment over which intuitively define the queries of a user,

  • allow the interaction between the designer and the final user for specific request refinements,

  • produce useful and non ambiguous documentation.

Fig. 7.
figure 7

Example of flattened table.

The conceptual representation deriving from DFM is defined by a set of fact schema. The basic elements modeled by such a schema are the so called fact, measures and dimensions. A fact is useful for the decisional process: it models a set of events coming from the analysis context; it needs to be time evolving. A measure represents a numeric property of a fact, and it describe a quantitative aspect useful for further analysis. Finally, a dimension is a property with a finite domain of a fact and it describes one of the analysis coordinates.

According to the literature [4] a dimensional scheme consists of a set of fact schemes. The components of fact schemes are facts, measures, dimensions and hierarchies. A fact is a focus of interest for the decision-making process; typically, it models an event occurring in the enterprise world (e.g., sales and shipments). Measures are continuously valued (typically numerical) attributes which describe the fact from different points of view; for instance, each sale is measured by its revenue. Dimensions are discrete attributes which determine the minimum granularity adopted to represent facts; typical dimensions for the sale fact are product, store and date. Hierarchies are made up of discrete dimension attributes linked by one-to-one relationships, and determine how facts may be aggregated and selected significantly for the decision-making process. The dimension in which a hierarchy is rooted defines its finest aggregation granularity; the other dimension attributes define progressively coarser granularities. Hierarchies may also include non-dimension attributes. A non-dimension attribute contains additional information about a dimension attribute of the hierarchy, and is connected by a one-to-one relationship (e.g., the address); unlike dimension attributes, it cannot be used for aggregation. At a conceptual level, distinguishing between measures and dimensions is important since it allows the logical design to be more specifically aimed at the efficiency required by data warehousing applications.

6.1 Queries Representation

In general, querying an information system means linking different concepts through user defined paths in order to retrieve some data of interest; in particular, for relational databases this is done by formulating a set of joins to connect relation schemes. On the other hand, a substantial amount of queries on DWs are aimed at extracting summary data to fill structured reports to be analysed for decisional or statistical purposes. Thus a typical DW query can be represented by the set of fact instances, at any aggregation level, whose measure values have to be retrieved.

The sets of fact instances can be denoted by writing fact instance expressions. The simple language proposed in the literature [24] is aimed at defining, with reference to a dimensional scheme, the queries forming the expected workload for the DW, to be used for logical design; thus, it focuses on which data must be retrieved and at which level they must be consolidated.

A fact instance expression has the general form:

figure b

The pattern clause describes a pattern. The selection clause contains a set of Boolean predicates which may either select a subset of the aggregated fact instances or affect the way fact instances are aggregated. If an attribute involved either in a pattern clause or in a selection clause is not a dimension, it should be referenced by prefixing its dimension name.

6.2 MD Creation and Population on MMBR Approach

The Dimensional Fact Model (DFM) [4] approach has been used in our solution to describe the MD model. The list of annotated terms and the KB are the only two elements necessary to design and populate the MD. Each annotated term of the list is enriched by its type or subclass in order to understand if it is a measure, a dimension or an instance of a dimension. This can be realized by means of a set of SPARQLFootnote 5 queries over the KB (an example of query is shown in Sect. 7) generated by the Schema builder component. With this information it is possible to create the DFM and the corresponding logical relational schema by means of the original methodology proposed in [4]. The relational schema is then populated according to the mapping information defined in the knowledge base.

All dimensional tables are populated with the instances defined in the KB, while the fact table is defined in a two steps procedure. In the first step all instances of the facts (e.g. sell or loan) are selected from the ODB by taking into account only the measures available in the annotated list. The second step is in charge of connecting the fact table with the dimensional tables. An Update query is executed to associate each instance of the fact table with the instances of the dimensions tables. Even in this case the KB plays a strategic role since it allows to extract the mapping formula at the basis of the SPARQL queries (see Sect. 7).

7 Case Study

After a brief introduction over the ontology defined in this work, an example of two financial reports that have to be filled is reported, together with an example of the mapping rules and the defined sparql queries. The implemented software prototype, supporting the MMBR approach, and a brief discussion about the methodology are finally presented.

7.1 The Considered Reports

The scenario motivating the definition of a report driven approach for the design of multidimensional models is related to the financial domain. In particular, the reporting activity of securitization was analyzed.

Applying the MMBR approach in this context, the first activity to be faced is the generation of the domain KB and vocabulary. The literature proposes two different vocabularies that partially describe the loan domain: FIBOFootnote 6 and Schema.orgFootnote 7. FIBO, a Financial Industry Business Ontology, contains the loan terms definitions without any further specification. Schema.org does not contain a full exhaustive specification of the securitization domain, but it includes the LoanOrCredit conceptsFootnote 8 only. The KB defined in this work to describe the securitization domain is the ontology OntoLoan. During the KB definition, domain experts are in charge to define the main terms and concepts. The OntoLoan ontology is not freely available, since it is covered by the company’s intellectual properties. However, the top level of OntoLoan is shown in Fig. 8, while Fig. 9 shows an example of securization report. Note that all private data related to the bank owning the report are removed for privacy issues, while the values for different kinds of loans are reported.

The term Performing Loans refers to those loans that have no overdue interest payments, or with unpaid installments due, even if under the maximum number of delay days outstanding (which changes according to the securitization contract terms). Delinquent Loan refers to the loans close to default, i.e., to unpaid installments due to a delay in payments close to the limit on the delay of days overdue. Defaulted Loans refers then to loans with significant delays in payments.

Any kind of loan is further divided according to other features, generating the definition of Mortgage Loan, Guaranteed Loan, i.e. loans insured not by mortgages but by other guarantees (e.g., pledges), and, finally, Unguaranteed Loan, i.e. not insured.

The first phase of the MMBR approach removes text fields that do not carry relevant information from the report. An example of removed test is the string “A. PORTFOLIO OUTSTANDING BALANCE”. The annotation tool removes the cell spanning from the table of Fig. 9, creating the table structure shown in Fig. 10. The data-cell in position \(<\!\!3,3\!\!>\) represents the aggregation of the values of Outstanding Principal of loans that are both performing and able to pay off the loan even in case of default of the borrower. The value in the cell with position \(<\!3,4\!>\) represents the aggregation of the Outstanding Principal of loans that are both performing and guaranteed. The mapping rule MR1 is described as follows:

figure c

The rule MR1 indicates that for the context context1 the defaulted value (instance of performance category) is associated to the loans having a rating34 equals to 10 and a rating5 between 1 and 7.

7.2 SPARQL Queries Definition

With this first activity the following list of terms related to the domain is extracted loan:Performing, loan:Mortgage, loan:Guaranteed, loan:Unguaranteed, loan:Delinquent, loan:Defaulted, loan:DelinquentInstalments, loan:Outstanding Principal, loan:AccruedInterest, loan:PrincipalInstalment, loan:Interest Instalment. For each element of the list, MMBR retrieves from the KB the name of the dimensions or measures related to it, by means of SPARQL queries.

An example of query is the following.

figure d
Fig. 8.
figure 8

The top level representation of OntoLoan.

Fig. 9.
figure 9

An example of report template.

Fig. 10.
figure 10

An example of flattened report.

The example query is able to recognize, as shown in Fig. 8, that the loan:Guarantee element is member of an entity named Guarantee_Category, which is a subclass of qb:DimensionProperty. Figure 8 also shows the query properties. After the identification of the measures and the dimensions, the DFM is designed as shown in Fig. 11, according to Literature (see [25]) for the schema definition.

The DFM is then translated into a relational schema, whose instance is created in a relational DBMS as described in Sect. 6, and shown in Fig. 12.

Fig. 11.
figure 11

Dimensional fact model schema example.

Fig. 12.
figure 12

Data warehouse schema.

In order to update the fact table, it is possible to retrieve the mapping formula in the KB, by means of a SPARQL query. For example to update the guaranteed loan, first we retrieve from the KB the corresponding mapping formula by using the following query:

figure e

The result is the following predicate:

figure f

The corresponding update query using IBM DB2 SQL is:

figure g

The last phase is related to the generation of the report. Let us assume to generate a report where the cell corresponding to the coordinates F and 22 contains the sum of interest installments of all defaulted and guarantee loans. In the report graph the cell F22 is annotated as follows:

figure h

The ReportGenerator module creates the aggregate SQL query able to compile the cell. First of all, it retrieves the correct aggregation operator to be used, i.e. here it is a sum, then by querying the rdf fragment of the report graph it discovers that the values to be aggregated belong to the attribute InterestInstalment. To create the FROM statement of the query the ReportGenerator interrogates the ontology finding that both ontoLoan:Defaulted and ontoLoan:Unguaranteed concepts are instances of ontoLoan:DimensionalProperty and that both attributes are included in the Loans table. Thus, the FROM condition includes the Loans table only. The WHERE condition is composed creating a conjunctive predicates of current_performance_category = Defaulted and guarantee_category = Unguaranteed. The final SQL query for computing the value of the Cell F22 is the following:

figure i

The report generated is show in Fig. 13.

Fig. 13.
figure 13

Report generated.

7.3 Discussion

The MMBR methodology and related techniques supporting the creation of multidimensional model able to produce a given (set of) report(s). The term “by report” refers to the capability of our solution to create a multidimensional model starting from a given report (typically expressed as Microsoft Excel file) that must to be filled with real data. MMBR is also able to generate the relational data structure related to the created a MD model and it is also in charge of filling both fact and dimensional tables thanks to the use of domain ontologies enriched with mapping information to the operational sources. According to the literature, this is the first attempt to define a methodological approach for creating MD starting directly from the reports only. The methodology starts with the acquisition of the excel file and, thanks to an table identification algorithm, then it extracts rows and headers representing domain concept from the report, and the extracted terms are annotated by using a domain ontology enriched with md concepts. The ontological terms are finally used to design the MD.

One of the most important elements in our methodology is the use of a domain ontology, applied in order to annotate terms available in the report. Such ontological terms are used to identify fact, dimensions and instances of dimension that allow the creation and population of the MD model, by generating the Dimensional Fact model and the Entity Relational schema.

A prototype supporting the proposed methodology has been developed in the experimental session. The report graph is created by using Protegé, which is adopted to support the RCHIE phase in a semiautomated way. The phases of the methodology involved into the creation and population of the data warehouse are supported by a custom tool we named “CreDaW”, (Create a Data Warehouse).

The tool, as described in Sect. 6, creates the DW schema by querying the Report graph. The relational tables implementing the DW schema are populated by querying the KB as reported in Sect. 7. The prototype is developed and tested on an Intel I7-6700 personal computer with 3.4 GHz, 16 GB ram and 1 TB hdd and it is able to create and populate a DW in two different relational database management systems, MySQL and IBM DB2.

The data warehouse population phase requires around 13 s for loading more than 400.000 loans.

The last phase (the “Report generation”) of the methodology is totally automated by the tool ReGe. The tool is able to read the report graph and, by using the Apache POI library (https://poi.apache.org/), the report template. For each observation in the report graph a SQL query is created and executed; the result fills the corresponding cell of the report template. A report generation requires less than 2 s.

8 Conclusions and Future Work

This work presents a “Multidimensional Model By Report” (MMBR) approach supporting the creation of multidimensional models able to produce a given (set of) report(s). The term “by report” refers to the ability to create a multidimensional (MD) model starting from a given report (typically expressed as Microsoft Excel file) that has to be filled with data extracted from a set of heterogeneous sources.

Important contributions refer to the automatic generation of the relational data structure correlated to the MD models generated by the approach, and to the ability to fill both fact and dimensional tables on the basis of domain ontologies enriched with mapping information related to the data sources.

There may be several future directions of research. The first one is related to the definition of an approach for the automatic computation of aggregates of data according to the topological position of the cells that contain them, by taking into account row and column headers.

Another interesting research activity will study how to enrich the table identification algorithm. The aim is to allow the management of a larger (w.r.t., the actual algorithm) number of types of report, improving the efficiency of the presented approach.