Keywords

1 Introduction

As datasets have grown vastly larger in recent years, the risk that a database might contain “invalid” data—i.e., data that are erroneous, redundant or obsolete—has also escalated. This trend highlights the need for effective procedures for data cleaning, broadly defined as the detection and removal of errors and redundant or obsolete information from datasets of various structures in order to improve their quality.

In general, data cleaning entails three main processes, which have been discussed extensively in prior research (see, e.g. [1,2,3,4]): (i) identification of invalid records; (ii) validation, in which errors are corrected and invalid information is eliminated; and (iii) imputation, i.e., the replacement of invalid records.

In a very large database, it can be highly time-consuming and even computationally infeasible to execute each of these processes on the complete data structure. Herein, we develop a methodology for improving the efficiency of data cleaning in a hierarchical database (HDB)—the oldest and simplest type of database, in which the data are structured as a (tiered) hierarchical tree-type graph. In an HDB, data cleaning ultimately entails transforming the original database into a reduced graph model with a smaller set of tiers and nodes, in which only relevant information is retained.

The basic idea of our approach is to quantify the extent to which invalid data are likely to have caused information quantity loss in specific components of the database, thereby enabling the decision maker to prioritize these components for cleaning and to ignore components in which information loss is less severe. Though diverse definitions exist for information quantity in an HDB (see [5, 6] for in-depth discussions), we suggest that entropy—and specifically, Shannon’s information entropy [7, 8]—can serve a useful tool for measuring the quantity of information in a database [9, 10] and for identifying where the most informative records are located. In general, entropy is a measure of uncertainty, chaos, and absence of knowledge. Thus, entropy taken with the opposite sign is equivalent to the information amount, or the knowledge available [7, 8].

To our knowledge, entropy-based approaches to information measurement and data cleaning in databases have received comparatively little attention in the literature. Yet, entropy is used in many domains as a means of evaluating system complexity (e.g., [11,12,13,14,15,16]). Several studies have explored the fact that a high level of entropy (or chaos) in a complex system impedes perfect system performance: the greater the entropy, the more uncertain is the state of the entire complex system and, consequently, a larger amount of information is required to monitor and manage the system. Adopting this perspective, herein we consider entropy as a complexity measure for estimating the amount of information contained in specific components of an HDB.

Prior studies [12, 15, 16] used entropy-based approaches in considering simple single-tier production systems. The present paper extends these approaches to the case of multi-layer hierarchical data structures, and, in particular, the approach suggested by Karp and Ronen [16]. However, in contrast to the prior works, we investigate entropy in databases rather than in production systems, and hence the contents and measurement of the entropy are different.

Though a key objective in developing our approach is to provide an efficient means of cleaning the data in an HDB (towards improving data quality), our methodology enables us to achieve a broader objective: producing a more streamlined representation of an HDB. Specifically, as elaborated in what follows, our approach can be used to identify which data tiers in an HDB contain essential amounts of information. In effect, the decision maker might choose to eliminate subsequent nonessential tiers, thereby reducing the size of the HDB without losing important information.

The remainder of the paper is structured as follows. In the next section we provide a basic overview of the HDB architecture. Section 3 provides basic definitions used in our entropy-based approach, and Sect. 4 introduces the approach itself. In Sect. 5, we briefly describe the application of our approach to two real-life databases. Section 6 concludes.

2 Overview of the HDB Architecture

An HDB architecture is a tree-like graph structure in which the data are stored as records, where a record is defined as a collection of fields [2,3,4]. Each node in the graph represents a cluster of records. The nodes are organized in tiers that are sequentially linked to one another through graph arcs. Specifically, the “top” tier—tier zero, the root node—contains all initial records in the database. The root is linked to a first tier of nodes, representing clusters of these records. Each of these nodes is linked to a second tier of sub-clusters, and so on. We study the HDBs by traversing the graph from the top down, starting from the root node.

HDB models are often used in situations in which the primary focus of information gathering is on a concrete system hierarchy, such as a list of components of a complex equipment or a company organization chart. Examples of such databases are shown in in Figs. 1 and 2, which, respectively, illustrate the hierarchical structures of a pharmaceutical materials database and an automobile component database. Specifically, Fig. 1 is an example of an HDB representing the distribution of pharmaceutical materials managed by a healthcare provider. The root node contains all records of all pharmaceutical materials that the provider (Distribution Center) holds. In the first tier, these records are clustered according to their distribution across the large geographical regions that the provider services. The second tier represents sub-regions, and the third tier represents the records clustered according to the individual clinics corresponding to each sub-region.

Fig. 1.
figure 1

(adapted from Levner et al. [14])

A schematic structure of a database of pharmaceutical inventory held by a distribution center

Fig. 2.
figure 2

A structure of a database of car components produced by an automobile manufacturer (adjusted from Levner and Ptuskin [13])

The HDB illustrated in Fig. 2 represents the products produced by a car manufacturer. In this HDB, the root node contains all information about all car components produced by the manufacturer. The first tier contains clusters of components corresponding to large, ready-to-install subsystems, e.g., body assembly, seat assembly, engine module, etc. The second tier represents sub-components of the larger components represented by the first-tier nodes, and so on.

A third example is an HDB containing patient data for a healthcare provider. In this case, the root node depicts all the available records corresponding to patients (see [17, 18]). The nodes (clusters) of the first tier depict clusters of records for the subgroups of patients (e.g., patients of specific age groups). For each of these subgroups, the nodes of the second tier provide the gender division of the information about the patients of different age categories (in the first-tier clusters). The third tier contains, for each of the latter clusters of patient records, sub-clusters of records corresponding to the number of times these patients visited doctors, etc.

3 Basic Definitions

Our approach involves using previously-gathered statistical data regarding invalid records in the dataset to estimate the entropy—i.e., to quantify the information—associated with specific nodes in the HDB. In developing our approach, we conceptualize the invalid data records present in an HDB as a set of random events. Specifically, an invalid data record in a database is referred to as an adverse event.

We assume that, at some previous point in time and for a prespecified period of time (e.g., one year), an expert observed and made note of adverse events in the various components of the HDB, as well as potential causes for these events—referred to as risk drivers. For example, a risk driver for an erroneous data record in an HDB may be an operator’s error or malicious activity by a hacker. A typical risk driver for redundant data is integration of data from multiple sources.

The expert-produced list of adverse events and their corresponding risk drivers is referred to as the risk protocol [13]. Similarly to [13], for each node i in the HDB, we assume that the information stored for that node in the risk protocol—i.e., the adverse events that occurred in that node over the prespecified time period and their risk drivers—is recorded in a table, TBLi. Each row in the table corresponds to an individual adverse event occurring in node i. To the extent that the tables TBLi, for all the nodes belonging to each HDB tier, are assumed to be given, the statistical data needed to apply our approach can be derived.

Define a tier Ls (also denoted as tier s) as the set of nodes that are at the same distance s from the root node n0 in the graph of the HDB (see Figs. 1 and 2). Define a subtree Ts as a union of the tiers \( L_{0} ,L_{1} , \ldots L_{s} \).

We assume that, for each node (cluster) of the HDB, the list of risk drivers \( \varvec{F} = \left( {f_{1} ,f_{2} , \ldots {\text{ f}}_{N} } \right) \) corresponding to the adverse events in that node is given. For simplicity but without loss of generality, we assume that N, the number of all possible risk drivers corresponding to the node, is the same for all tiers, and that any adverse event is caused by a single risk driver. We call a cluster invalid if some invalid records are found in it.

Introduce the following notation:

  • ns the total number of clusters in tier s;

  • w(s) the number of invalid clusters in s registered in the risk protocol;

  • m(i, s) the number of records in cluster i of s;

  • vf,(i, s) an estimate of the percent (frequency) of invalid records caused by driver f in cluster i of s. We assume that this value is either given by experts or calculated by using one of the standard statistical methods of the survey sampling (see, e.g. [19]).

Consider the following event Af(i, s):

$$ \begin{array}{*{20}c} {A_{f} \left( {i,s} \right) = \left\{ {{\text{the}}\;{\text{risk}}\;{\text{driver}}\;f\;{\text{is}}\;{\text{a}}\;{\text{source}}\;{\text{of}}\;{\text{an}}\;{\text{adverse}}\;{\text{event}}\;{\text{in}}\;{\text{cluster}}\;i\;{\text{of}}\;{\text{tier}}\;s} \right\},} \\ {f = 1, \ldots ,N, \, s = 1, \, 2 \ldots } \\ \end{array} $$

Denote the probability of the event Af(i, s) by pf(i, s). Assume that the frequency estimate vf(i, s) of the event Af(i, s) can serve as an estimation of the probability of the event Af(i, s), that is:

$$ p_{f} \left( {i,s} \right) \, = v_{f} \left( {i,s} \right). $$
(1)

We assume, for simplicity, that the considered adverse events are independent.

Now, we turn to determining the probabilities of adverse events in the entire tier.

Let pf(s) denote the probability of adverse events caused by risk factor f in tier s. We define pf(s) as a weighted average of the probabilities pf(i, s) as follows:

$$ p_{f} \left( s \right) \, = \, \left[ {p_{f} \left( {1,s} \right) \cdot m\left( {1,s} \right) + \ldots + p_{f} \left( {n_{s} ,s} \right) \cdot m\left( {n_{s} ,s} \right)\left] / \right[m\left( {1,s} \right) \, + \ldots + m\left( {n_{s} ,s} \right)} \right], \;\;\; f = 1, \ldots ,N. $$
(2)

Notice that the summation pf(s) for any f is taken over all the (invalid) clusters of tier s. This value is used for computing the entropy in the next section.

4 Entropy as a Measure of Information Quantity

We now proceed to show how previously-gathered statistical information about invalid data in an HDB can be used to quantify the information loss that is likely to be attributable to such adverse events. The larger the HDB, the larger the loss. Moreover, the larger the size of the cluster in the HDB, the larger the statistical number of mistakes and redundancies, and, hence, the information loss.

We start by formally defining information entropy, as follows [9, 10]. Given a set of events \( E = \left\{ {e_{1} , \ldots ,e_{n} } \right\} \) with a priori probabilities of event occurrence \( P = \left\{ {p_{1} , \ldots ,p_{n} } \right\},p_{i} > 0 \), such that \( p_{i} + \ldots + p_{n} = 1 \), the entropy function H is defined by

$$ H = \, {-}K\Sigma _{i = 1, \ldots ,n} \;p_{i} \cdot \log p_{i} , $$
(3)

where K is a constant corresponding to a choice of measurement units.

The main idea of the suggested entropy-based approach is that the information entropy estimates the amount of information contained in a flow of adverse events registered in the risk protocol. Thus, the entropy characterizes the uncertainty, or the absence of our knowledge about the available information in the HBD. Recall that the lower the entropy, the more information (knowledge) is available in the HDB.

Using the above relations (1)–(3), we can compute the entropy H(s) of each tier s by applying the probabilities pf(s) defined by (2), as follows:

$$ H\left( s \right) = - K(\Sigma _{f = 1, \ldots N} \;p_{f} \left( s \right) \cdot \log p_{f} \left( s \right)) $$
(4)

Although each tier contains its own set of associated clusters, entropy H(s) at each tier s estimates, in fact, entropy of the entire original HDB. This observation is due to the fact that, for any tier s, s = 1.2,…, all the clusters altogether contain, by their construction, the whole set of all the records in the original database, which are “packed” into the corresponding clusters.

Entropy Dynamics.

Consider an arbitrary HDB and let the tier number s be s = 1, 2, … Consider a fragment of the HDB containing the tiers from 1 to t only, where t is some integer. Let us call such a structure t-truncated.

We believe that the normalizing coefficient K = Ks should take into account the amount of time spent on cleaning. Let the time for correcting/removing an adverse event caused by factor f in cluster i in tier s be Tf(i, s). Let the time for correcting/removing all the invalid clusters (nodes) in s be Ts and the time for the retrieval of all the records in s (including the corrected records) in the HDB be \( T_{s}^{ret} \), where \( T_{s} < T^{ret} \).

Assume that

$$ K_{s} = \, (T_{s}^{ret} - T_{s} )/T_{s}^{ret} . $$
(5)

The share of the cleaning time with respect to the entire retrieval time for tier s, 1 ≤ s ≤ t, is: \( T_{s} /T_{s}^{ret} \). The cleaning time share in the next tier, s + 1, is \( T_{s + 1} /T_{s + 1}^{ret} \).

The key result is as follows. The entropy H(s), defined as a function of the tier number s satisfying (4) and (5), decreases with increasing s.

To prove this fact, we observe first that ratio Ks in (5) decreases with increasing s. Indeed, as we noted in the previous section, the number of invalid records in any cluster is estimated by the standard statistical methods of survey sampling. For simplicity but without loss of generality, we may assume that the average cleaning time is approximately the same for any record in any cluster in any tier. But the number of different clusters quickly grows with the growth of s; therefore, \( T_{s}^{ret} - T_{s} \) falls with the growth of s. Further, the percent of invalid records is relatively small; in this case, \( T^{ret} \) only slightly decreases with the growth of s. Therefore, Ks in (5) decreases when s increases.

The term \( \Sigma \;f\;p_{f} .\log \;p_{f} \), in definition (4) is bounded from above by H(0) = log N, where N is known (i.e., noted in the risk protocol) and bounded from above. As the ratio Ks decreases with increasing s, then entropy H(s) defined by (4) decreases when s increases.

A similar result about the behavior of entropy in hierarchical systems was obtained and confirmed through experiments in [11, 13, 16, 20].

Stopping Rule.

Taking into account the entropy dynamics described above, we can identify the minimal number of steps for the cleaning process to be sufficient, assuming that the decision maker sequentially computes the entropy for the HDB in each tier, one after another, starting with the root node. Specifically, such an iterative cleaning process can be stopped as soon as the entropic decrease gradient grad becomes sufficiently small upon moving to the next tier. Formally, we define an allowed inaccuracy level, δ, and the iterative cleaning process stops at a critical tier s0 when the following holds:

$$ grad = \, (H(s_{0} ) \, - H({\text{s}}_{0} + 1))/H(s_{0} ) \, \le \delta . $$
(6)

This stopping rule determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined; it defines a more streamlined representation of the database, in which less informative tiers are eliminated.

As we shall see in the next section, there is no need to clean the entire large-size database; rather, it is sufficient to find the critical tier, s0, that satisfies (6) and then to perform the cleaning operations for the clusters found at this tier.

5 Computational Application

We have applied the approach described above to two real-life HDB datasets. The first is a dataset from a real automobile HDB, similar to that presented in the example above (Fig. 2) and described in [13]. The initial dataset contained information for about 12,000 car components. We analyzed the entropy dynamics for a subset of the dataset containing 1,500 nodes. We set δ = 0.03 and observed that for t = 5 and larger, the value of grad is less than δ = 0.03, indicating that it is possible to stop the cleaning process after the fifth tier of the database (see Fig. 3). This stopping rule produces a truncated database structure with 5 tiers, containing 370 nodes at the 5th tier.

Fig. 3.
figure 3

Results of the entropy computations

The second dataset we used was a massive dataset from a medical data warehouse (DWH) run by Clalit Health Services (“Clalit”), the largest health maintenance organization in Israel [17, 18]. The Clalit DWH comprises historical and current health data for several millions of individuals over the last 25 years. The data include clinical, communicational and administrative data collected from primary care and specialty clinics, hospitals, diagnostic laboratories and medical centers, and national disease registries. The sociodemographic information related to patients and clinics is received periodically from national registries (e.g., Israeli Central Bureau of Statistics).

In prior research [17, 18], we used the Clalit DWH in an investigation focused on patient profile discovery. In this analysis, hundreds of thousands of data records were extracted from the DWH. The authors noticed that a high data granularity allows access to a large number of details regarding each patient but is not efficient for summarizing the data at sub-population levels for supporting the development of new and efficient healthcare policies. Data classification and analysis were done by using the clustering (k-means) method for reducing data complexity. More specifically, at the preprocessing step, the clustering supported data discretization of dozens of administrative and socio-demographical attributes, e.g., location of residence, diagnosis granularity, details of drug treatment, communicational ways to access to the Clalit website, etc. Then the hierarchical clustering during the data mining step helped to discover a very limited number (around a dozen) of patient profiles. Complementary to the studies [17, 18], the suggested entropic approach enabled us to clean the dataset from invalid data and to reduce the dimensionality of the subset extracted from the Clalit DWH.

Herein, in our computational experiments, we applied our entropy-based approach to a subset of DWH data related to patients with diabetes, comprising over 300,000 records as in [18, 19]. However, in contrast to the latter studies, we divided the initial set into other sets of homogenous clusters. We iteratively used the following five attributes for dividing each tier into clusters (categories): (1) age (with 10 categories); (2) gender (2 categories); (3) illness acuteness (3 categories); (4) adherence to treatment (7 categories), and (5) the number of visits of patients to healthcare practitioners (25 categories). We used the stopping rule found in Sect. 4, and set the stopping coefficient at δ = 0.02. The records in each of the clusters of the fourth tier were homogenous with respect to attributes (1)–(4), whereas the records in each of the 5th-tier clusters were homogenous with respect to all the attributes (1)–(5).

The stopping rule indicated that five tiers is sufficient for reaching a stable entropy value in this experiment. Note that in the reduced graph the fourth tier contained about 500 clusters, each one containing about 600 records; and the fifth tier contained about 3,000 clusters each one containing from 90 to 100 records. The behavior of the entropy was observed to be similar to that shown in Fig. 3. As a result of the cleaning and clustering, instead of the original dataset with about 300,000 records, we obtained a reduced graph structure with about 3,000 nodes in the final tier.

6 Concluding Remarks

This paper developed an entropy-based approach for organizing data cleaning in an HDB. This approach enables the decision maker to identify the nodes and tiers in the HDB in which essential quantities of information are likely to have been lost, and to ignore nonessential tiers, thus making the cleaning process more efficient, and producing a reduced representation of the original HDB graph.

We note that combining the data processing (clustering) methodology described in [17, 18] with the entropic approach may enable researchers to streamline their datasets in large-scale investigations (e.g., epidemiological investigations). That is, prior to initiating such an investigation, it could be worthwhile to apply the entropy-based procedure proposed herein, and thus to eliminate redundant information (e.g., obtaining a dozen of the most informative attributes instead of a few hundred). We plan to adopt this approach to further investigate the Clalit DWH, focusing on practical issues of enhancing the influenza vaccination process.

In our future research, we intend to undertake a more in-depth analysis of links between entropy and costs in large-scale databases and to perform cost-benefit-risk analysis of the HDBs, considering the stochastic characteristics of data types.