Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases

Levner, Eugene; Kriheli, Boris; Benis, Arriel; Ptuskin, Alexander; Elalouf, Amir; Hovav, Sharon; Ashkenazi, Shai

doi:10.1007/978-3-030-59612-5_1

Eugene Levner¹⁴,
Boris Kriheli^14,15,
Arriel Benis¹⁴,
Alexander Ptuskin¹⁶,
Amir Elalouf¹⁷,
Sharon Hovav^17,18 &
…
Shai Ashkenazi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12402))

Included in the following conference series:

International Conference on Big Data

946 Accesses

Abstract

When databases are at risk of containing erroneous, redundant, or obsolete data, a cleaning procedure is used to detect, correct or remove such undesirable records. We propose a methodology for improving data cleaning efficiency in a large hierarchical database. The methodology relies on Shannon’s information entropy for measuring the amount of information stored in databases. This approach, which builds on previously-gathered statistical data regarding the prevalence of errors in the database, enables the decision maker to determine which components of the database are likely to have undergone more information loss, and thus to prioritize those components for cleaning. In particular, in cases where the cleaning process is iterative (from the root node down), the entropic approach produces a scientifically motivated stopping rule that determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined. This stopping rule defines a more streamlined representation of the database, in which less informative tiers are eliminated.

You have full access to this open access chapter, Download conference paper PDF

Data Quality Mining

Towards a Unified Framework for Data Cleaning and Data Privacy

Data Quality Management: An Overview of Methods and Challenges

Keywords

1 Introduction

As datasets have grown vastly larger in recent years, the risk that a database might contain “invalid” data—i.e., data that are erroneous, redundant or obsolete—has also escalated. This trend highlights the need for effective procedures for data cleaning, broadly defined as the detection and removal of errors and redundant or obsolete information from datasets of various structures in order to improve their quality.

In general, data cleaning entails three main processes, which have been discussed extensively in prior research (see, e.g. [1,2,3,4]): (i) identification of invalid records; (ii) validation, in which errors are corrected and invalid information is eliminated; and (iii) imputation, i.e., the replacement of invalid records.

In a very large database, it can be highly time-consuming and even computationally infeasible to execute each of these processes on the complete data structure. Herein, we develop a methodology for improving the efficiency of data cleaning in a hierarchical database (HDB)—the oldest and simplest type of database, in which the data are structured as a (tiered) hierarchical tree-type graph. In an HDB, data cleaning ultimately entails transforming the original database into a reduced graph model with a smaller set of tiers and nodes, in which only relevant information is retained.

The basic idea of our approach is to quantify the extent to which invalid data are likely to have caused information quantity loss in specific components of the database, thereby enabling the decision maker to prioritize these components for cleaning and to ignore components in which information loss is less severe. Though diverse definitions exist for information quantity in an HDB (see [5, 6] for in-depth discussions), we suggest that entropy—and specifically, Shannon’s information entropy [7, 8]—can serve a useful tool for measuring the quantity of information in a database [9, 10] and for identifying where the most informative records are located. In general, entropy is a measure of uncertainty, chaos, and absence of knowledge. Thus, entropy taken with the opposite sign is equivalent to the information amount, or the knowledge available [7, 8].

To our knowledge, entropy-based approaches to information measurement and data cleaning in databases have received comparatively little attention in the literature. Yet, entropy is used in many domains as a means of evaluating system complexity (e.g., [11,12,13,14,15,16]). Several studies have explored the fact that a high level of entropy (or chaos) in a complex system impedes perfect system performance: the greater the entropy, the more uncertain is the state of the entire complex system and, consequently, a larger amount of information is required to monitor and manage the system. Adopting this perspective, herein we consider entropy as a complexity measure for estimating the amount of information contained in specific components of an HDB.

Prior studies [12, 15, 16] used entropy-based approaches in considering simple single-tier production systems. The present paper extends these approaches to the case of multi-layer hierarchical data structures, and, in particular, the approach suggested by Karp and Ronen [16]. However, in contrast to the prior works, we investigate entropy in databases rather than in production systems, and hence the contents and measurement of the entropy are different.

Though a key objective in developing our approach is to provide an efficient means of cleaning the data in an HDB (towards improving data quality), our methodology enables us to achieve a broader objective: producing a more streamlined representation of an HDB. Specifically, as elaborated in what follows, our approach can be used to identify which data tiers in an HDB contain essential amounts of information. In effect, the decision maker might choose to eliminate subsequent nonessential tiers, thereby reducing the size of the HDB without losing important information.

The remainder of the paper is structured as follows. In the next section we provide a basic overview of the HDB architecture. Section 3 provides basic definitions used in our entropy-based approach, and Sect. 4 introduces the approach itself. In Sect. 5, we briefly describe the application of our approach to two real-life databases. Section 6 concludes.

2 Overview of the HDB Architecture

An HDB architecture is a tree-like graph structure in which the data are stored as records, where a record is defined as a collection of fields [2,3,4]. Each node in the graph represents a cluster of records. The nodes are organized in tiers that are sequentially linked to one another through graph arcs. Specifically, the “top” tier—tier zero, the root node—contains all initial records in the database. The root is linked to a first tier of nodes, representing clusters of these records. Each of these nodes is linked to a second tier of sub-clusters, and so on. We study the HDBs by traversing the graph from the top down, starting from the root node.

HDB models are often used in situations in which the primary focus of information gathering is on a concrete system hierarchy, such as a list of components of a complex equipment or a company organization chart. Examples of such databases are shown in in Figs. 1 and 2, which, respectively, illustrate the hierarchical structures of a pharmaceutical materials database and an automobile component database. Specifically, Fig. 1 is an example of an HDB representing the distribution of pharmaceutical materials managed by a healthcare provider. The root node contains all records of all pharmaceutical materials that the provider (Distribution Center) holds. In the first tier, these records are clustered according to their distribution across the large geographical regions that the provider services. The second tier represents sub-regions, and the third tier represents the records clustered according to the individual clinics corresponding to each sub-region.

The HDB illustrated in Fig. 2 represents the products produced by a car manufacturer. In this HDB, the root node contains all information about all car components produced by the manufacturer. The first tier contains clusters of components corresponding to large, ready-to-install subsystems, e.g., body assembly, seat assembly, engine module, etc. The second tier represents sub-components of the larger components represented by the first-tier nodes, and so on.

A third example is an HDB containing patient data for a healthcare provider. In this case, the root node depicts all the available records corresponding to patients (see [17, 18]). The nodes (clusters) of the first tier depict clusters of records for the subgroups of patients (e.g., patients of specific age groups). For each of these subgroups, the nodes of the second tier provide the gender division of the information about the patients of different age categories (in the first-tier clusters). The third tier contains, for each of the latter clusters of patient records, sub-clusters of records corresponding to the number of times these patients visited doctors, etc.

3 Basic Definitions

Our approach involves using previously-gathered statistical data regarding invalid records in the dataset to estimate the entropy—i.e., to quantify the information—associated with specific nodes in the HDB. In developing our approach, we conceptualize the invalid data records present in an HDB as a set of random events. Specifically, an invalid data record in a database is referred to as an adverse event.

We assume that, at some previous point in time and for a prespecified period of time (e.g., one year), an expert observed and made note of adverse events in the various components of the HDB, as well as potential causes for these events—referred to as risk drivers. For example, a risk driver for an erroneous data record in an HDB may be an operator’s error or malicious activity by a hacker. A typical risk driver for redundant data is integration of data from multiple sources.

The expert-produced list of adverse events and their corresponding risk drivers is referred to as the risk protocol [13]. Similarly to [13], for each node i in the HDB, we assume that the information stored for that node in the risk protocol—i.e., the adverse events that occurred in that node over the prespecified time period and their risk drivers—is recorded in a table, TBL_i. Each row in the table corresponds to an individual adverse event occurring in node i. To the extent that the tables TBL_i, for all the nodes belonging to each HDB tier, are assumed to be given, the statistical data needed to apply our approach can be derived.

Define a tier L_s (also denoted as tier s) as the set of nodes that are at the same distance s from the root node n₀ in the graph of the HDB (see Figs. 1 and 2). Define a subtree T_s as a union of the tiers $ L_{0} ,L_{1} , \ldots L_{s} $.

We assume that, for each node (cluster) of the HDB, the list of risk drivers $ \varvec{F} = \left( {f_{1} ,f_{2} , \ldots {\text{ f}}_{N} } \right) $ corresponding to the adverse events in that node is given. For simplicity but without loss of generality, we assume that N, the number of all possible risk drivers corresponding to the node, is the same for all tiers, and that any adverse event is caused by a single risk driver. We call a cluster invalid if some invalid records are found in it.

Introduce the following notation:

n_s the total number of clusters in tier s;
w(s) the number of invalid clusters in s registered in the risk protocol;
m(i, s) the number of records in cluster i of s;
v_f,(i, s) an estimate of the percent (frequency) of invalid records caused by driver f in cluster i of s. We assume that this value is either given by experts or calculated by using one of the standard statistical methods of the survey sampling (see, e.g. [19]).

Consider the following event A_f(i, s):

$$ \begin{array}{*{20}c} {A_{f} \left( {i,s} \right) = \left\{ {{\text{the}}\;{\text{risk}}\;{\text{driver}}\;f\;{\text{is}}\;{\text{a}}\;{\text{source}}\;{\text{of}}\;{\text{an}}\;{\text{adverse}}\;{\text{event}}\;{\text{in}}\;{\text{cluster}}\;i\;{\text{of}}\;{\text{tier}}\;s} \right\},} \\ {f = 1, \ldots ,N, \, s = 1, \, 2 \ldots } \\ \end{array} $$

Denote the probability of the event A_f(i, s) by p_f(i, s). Assume that the frequency estimate v_f(i, s) of the event A_f(i, s) can serve as an estimation of the probability of the event A_f(i, s), that is:

$$ p_{f} \left( {i,s} \right) \, = v_{f} \left( {i,s} \right). $$

(1)

We assume, for simplicity, that the considered adverse events are independent.

Now, we turn to determining the probabilities of adverse events in the entire tier.

Let p_f(s) denote the probability of adverse events caused by risk factor f in tier s. We define p_f(s) as a weighted average of the probabilities p_f(i, s) as follows:

$$ p_{f} \left( s \right) \, = \, \left[ {p_{f} \left( {1,s} \right) \cdot m\left( {1,s} \right) + \ldots + p_{f} \left( {n_{s} ,s} \right) \cdot m\left( {n_{s} ,s} \right)\left] / \right[m\left( {1,s} \right) \, + \ldots + m\left( {n_{s} ,s} \right)} \right], \;\;\; f = 1, \ldots ,N. $$

(2)

Notice that the summation p_f(s) for any f is taken over all the (invalid) clusters of tier s. This value is used for computing the entropy in the next section.

4 Entropy as a Measure of Information Quantity

We now proceed to show how previously-gathered statistical information about invalid data in an HDB can be used to quantify the information loss that is likely to be attributable to such adverse events. The larger the HDB, the larger the loss. Moreover, the larger the size of the cluster in the HDB, the larger the statistical number of mistakes and redundancies, and, hence, the information loss.

We start by formally defining information entropy, as follows [9, 10]. Given a set of events $ E = \left\{ {e_{1} , \ldots ,e_{n} } \right\} $ with a priori probabilities of event occurrence $ P = \left\{ {p_{1} , \ldots ,p_{n} } \right\},p_{i} > 0 $, such that $ p_{i} + \ldots + p_{n} = 1 $, the entropy function H is defined by

$$ H = \, {-}K\Sigma _{i = 1, \ldots ,n} \;p_{i} \cdot \log p_{i} , $$

(3)

where K is a constant corresponding to a choice of measurement units.

The main idea of the suggested entropy-based approach is that the information entropy estimates the amount of information contained in a flow of adverse events registered in the risk protocol. Thus, the entropy characterizes the uncertainty, or the absence of our knowledge about the available information in the HBD. Recall that the lower the entropy, the more information (knowledge) is available in the HDB.

Using the above relations (1)–(3), we can compute the entropy H(s) of each tier s by applying the probabilities p_f(s) defined by (2), as follows:

$$ H\left( s \right) = - K(\Sigma _{f = 1, \ldots N} \;p_{f} \left( s \right) \cdot \log p_{f} \left( s \right)) $$

(4)

Although each tier contains its own set of associated clusters, entropy H(s) at each tier s estimates, in fact, entropy of the entire original HDB. This observation is due to the fact that, for any tier s, s = 1.2,…, all the clusters altogether contain, by their construction, the whole set of all the records in the original database, which are “packed” into the corresponding clusters.

Entropy Dynamics.

Consider an arbitrary HDB and let the tier number s be s = 1, 2, … Consider a fragment of the HDB containing the tiers from 1 to t only, where t is some integer. Let us call such a structure t-truncated.

We believe that the normalizing coefficient K = K_s should take into account the amount of time spent on cleaning. Let the time for correcting/removing an adverse event caused by factor f in cluster i in tier s be T_f(i, s). Let the time for correcting/removing all the invalid clusters (nodes) in s be T_s and the time for the retrieval of all the records in s (including the corrected records) in the HDB be $ T_{s}^{ret} $, where $ T_{s} < T^{ret} $.

Assume that

$$ K_{s} = \, (T_{s}^{ret} - T_{s} )/T_{s}^{ret} . $$

(5)

The share of the cleaning time with respect to the entire retrieval time for tier s, 1 ≤ s ≤ t, is: $ T_{s} /T_{s}^{ret} $. The cleaning time share in the next tier, s + 1, is $ T_{s + 1} /T_{s + 1}^{ret} $.

The key result is as follows. The entropy H(s), defined as a function of the tier number s satisfying (4) and (5), decreases with increasing s.

To prove this fact, we observe first that ratio K_s in (5) decreases with increasing s. Indeed, as we noted in the previous section, the number of invalid records in any cluster is estimated by the standard statistical methods of survey sampling. For simplicity but without loss of generality, we may assume that the average cleaning time is approximately the same for any record in any cluster in any tier. But the number of different clusters quickly grows with the growth of s; therefore, $ T_{s}^{ret} - T_{s} $ falls with the growth of s. Further, the percent of invalid records is relatively small; in this case, $ T^{ret} $ only slightly decreases with the growth of s. Therefore, K_s in (5) decreases when s increases.

The term $ \Sigma \;f\;p_{f} .\log \;p_{f} $, in definition (4) is bounded from above by H(0) = log N, where N is known (i.e., noted in the risk protocol) and bounded from above. As the ratio K_s decreases with increasing s, then entropy H(s) defined by (4) decreases when s increases.

A similar result about the behavior of entropy in hierarchical systems was obtained and confirmed through experiments in [11, 13, 16, 20].

Stopping Rule.

Taking into account the entropy dynamics described above, we can identify the minimal number of steps for the cleaning process to be sufficient, assuming that the decision maker sequentially computes the entropy for the HDB in each tier, one after another, starting with the root node. Specifically, such an iterative cleaning process can be stopped as soon as the entropic decrease gradient grad becomes sufficiently small upon moving to the next tier. Formally, we define an allowed inaccuracy level, δ, and the iterative cleaning process stops at a critical tier s₀ when the following holds:

$$ grad = \, (H(s_{0} ) \, - H({\text{s}}_{0} + 1))/H(s_{0} ) \, \le \delta . $$

(6)

This stopping rule determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined; it defines a more streamlined representation of the database, in which less informative tiers are eliminated.

As we shall see in the next section, there is no need to clean the entire large-size database; rather, it is sufficient to find the critical tier, s₀, that satisfies (6) and then to perform the cleaning operations for the clusters found at this tier.

5 Computational Application

We have applied the approach described above to two real-life HDB datasets. The first is a dataset from a real automobile HDB, similar to that presented in the example above (Fig. 2) and described in [13]. The initial dataset contained information for about 12,000 car components. We analyzed the entropy dynamics for a subset of the dataset containing 1,500 nodes. We set δ = 0.03 and observed that for t = 5 and larger, the value of grad is less than δ = 0.03, indicating that it is possible to stop the cleaning process after the fifth tier of the database (see Fig. 3). This stopping rule produces a truncated database structure with 5 tiers, containing 370 nodes at the 5^th tier.

The second dataset we used was a massive dataset from a medical data warehouse (DWH) run by Clalit Health Services (“Clalit”), the largest health maintenance organization in Israel [17, 18]. The Clalit DWH comprises historical and current health data for several millions of individuals over the last 25 years. The data include clinical, communicational and administrative data collected from primary care and specialty clinics, hospitals, diagnostic laboratories and medical centers, and national disease registries. The sociodemographic information related to patients and clinics is received periodically from national registries (e.g., Israeli Central Bureau of Statistics).

In prior research [17, 18], we used the Clalit DWH in an investigation focused on patient profile discovery. In this analysis, hundreds of thousands of data records were extracted from the DWH. The authors noticed that a high data granularity allows access to a large number of details regarding each patient but is not efficient for summarizing the data at sub-population levels for supporting the development of new and efficient healthcare policies. Data classification and analysis were done by using the clustering (k-means) method for reducing data complexity. More specifically, at the preprocessing step, the clustering supported data discretization of dozens of administrative and socio-demographical attributes, e.g., location of residence, diagnosis granularity, details of drug treatment, communicational ways to access to the Clalit website, etc. Then the hierarchical clustering during the data mining step helped to discover a very limited number (around a dozen) of patient profiles. Complementary to the studies [17, 18], the suggested entropic approach enabled us to clean the dataset from invalid data and to reduce the dimensionality of the subset extracted from the Clalit DWH.

Herein, in our computational experiments, we applied our entropy-based approach to a subset of DWH data related to patients with diabetes, comprising over 300,000 records as in [18, 19]. However, in contrast to the latter studies, we divided the initial set into other sets of homogenous clusters. We iteratively used the following five attributes for dividing each tier into clusters (categories): (1) age (with 10 categories); (2) gender (2 categories); (3) illness acuteness (3 categories); (4) adherence to treatment (7 categories), and (5) the number of visits of patients to healthcare practitioners (25 categories). We used the stopping rule found in Sect. 4, and set the stopping coefficient at δ = 0.02. The records in each of the clusters of the fourth tier were homogenous with respect to attributes (1)–(4), whereas the records in each of the 5th-tier clusters were homogenous with respect to all the attributes (1)–(5).

The stopping rule indicated that five tiers is sufficient for reaching a stable entropy value in this experiment. Note that in the reduced graph the fourth tier contained about 500 clusters, each one containing about 600 records; and the fifth tier contained about 3,000 clusters each one containing from 90 to 100 records. The behavior of the entropy was observed to be similar to that shown in Fig. 3. As a result of the cleaning and clustering, instead of the original dataset with about 300,000 records, we obtained a reduced graph structure with about 3,000 nodes in the final tier.

6 Concluding Remarks

This paper developed an entropy-based approach for organizing data cleaning in an HDB. This approach enables the decision maker to identify the nodes and tiers in the HDB in which essential quantities of information are likely to have been lost, and to ignore nonessential tiers, thus making the cleaning process more efficient, and producing a reduced representation of the original HDB graph.

We note that combining the data processing (clustering) methodology described in [17, 18] with the entropic approach may enable researchers to streamline their datasets in large-scale investigations (e.g., epidemiological investigations). That is, prior to initiating such an investigation, it could be worthwhile to apply the entropy-based procedure proposed herein, and thus to eliminate redundant information (e.g., obtaining a dozen of the most informative attributes instead of a few hundred). We plan to adopt this approach to further investigate the Clalit DWH, focusing on practical issues of enhancing the influenza vaccination process.

In our future research, we intend to undertake a more in-depth analysis of links between entropy and costs in large-scale databases and to perform cost-benefit-risk analysis of the HDBs, considering the stochastic characteristics of data types.

References

Lee, M.L., Lu, H., Ling, T.W., Ko, Y.T.: Cleansing data for mining and warehousing. In: Bench-Capon, T.J.M., Soda, G., Tjoa, A.M. (eds.) DEXA 1999. LNCS, vol. 1677, pp. 751–760. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48309-8_70
Chapter Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 1–11 (2000)
Google Scholar
Volkovs, M, Chiang, F., Szlichta J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th International Conference on Data Engineering, pp. 244–255 (2014)
Google Scholar
Khedri, R., Chiang, F., Sabri, K.E.: An algebraic approach towards data cleaning. Procedia Comput. Sci. 21, 50–59 (2013). https://www.sciencedirect.com/science/article/pii/S1877050913008028#aep-article-footnote-id2
Green, T.J., Tannen, V.: Models for incomplete and probabilistic information. In: Grust, T., et al. (eds.) EDBT 2006. LNCS, vol. 4254, pp. 278–296. Springer, Heidelberg (2006). https://doi.org/10.1007/11896548_24
Chapter Google Scholar
Zimielinski, T., Lipski, W.: Incomplete information in relational databases. In: Readings in Artificial Intelligence and Databases, pp. 342–360 (1989)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Article MathSciNet Google Scholar
Stone, J.V.: Information Theory. University of Sheffield, England (2014)
Google Scholar
Zhou, X., Zhang, Y., Hao, S., Li, S.: A new approach for noise data detection based on cluster and information entropy. In: Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), Shenyang, China, pp. 1416–1419 (2015)
Google Scholar
Kumar, V., Rajendran, G.: Entropy based measurement of text dissimilarity for duplicate-detection. Mod. Appl. Sci. 4(9), 142–146 (2010)
Article Google Scholar
Hebron, A., Levner, E., Hovav, S., Lin, S.: Selection of most informative components in risk mitigation analysis of supply networks: an information-gain approach. Int. J. Innov. Manag. Technol. 3(3), 267–271 (2012)
Google Scholar
Allesina, S., Azzi, A., Battini, D., Regattieri, A.: Performance measurement in supply chains: new network analysis and entropic indexes. Int. J. Prod. Res. 48(8), 2297–2321 (2010)
Article Google Scholar
Levner, E., Ptuskin, A.: An entropy-based approach to identifying vulnerable components in a supply chain. Int. J. Prod. Res. 53(22), 6888–6902 (2015)
Article Google Scholar
Hovav, S., Tell, H., Levner, E., Ptuskin, A., Herbon, A.: Healthcare analytics and big data management in influenza vaccination programs: use of information-entropy approach. In: Rodriguez-Taborda, E. (ed.) The Analytics Process: Strategic and Tactical Steps, pp. 211–237. Taylor & Francis, CRC Press, Boca Raton (2017)
Google Scholar
Isik, F.: An entropy-based approach for measuring complexity in supply chains. Int. J. Prod. Res. 48(12), 3681–3696 (2010)
Article Google Scholar
Karp, A., Ronen, B.: Improving shop floor control: an entropy model approach. Int. J. Prod. Res. 30(4), 923–938 (1992)
Article Google Scholar
Benis, A., Harel, N., Barak Barkan, R., Srulovici, E., Key, C.: Patterns of patients’ interactions with a health care organization and their impacts on health duality measurements: protocol for a retrospective cohort study. JMIR Res. Protoc. 7(11), e10734 (2018)
Article Google Scholar
Benis, A.: Identification and description of healthcare customer communication patterns among individuals with diabetes in Clalit Health Services: A retrospective database study for applied epidemiological research. Stud. Health Technol. Inform. 244, 38–42 (2017)
Google Scholar
Sapsford, R., Jupp, V.: Data Collection and Analysis. SAGE Publications, London (2006)
Book Google Scholar
Kriheli, B., Levner, E.: Entropy-based algorithm for supply-chain complexity assessment. Algorithms 11(4), 1–15 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgement

The research of Eugene Levner, Arriel Benis and Shai Ashkenazi is supported by the grant of the Ariel University and Holon Institute of Technology (Israel) “Applications of Artificial Intelligence for enhancing efficiency of vaccination programs” [no. RA19000000649]. The research of Alexander Ptuskin is supported by the grant of the Russian Foundation for Basic Research “Development of economic-mathematical methods for evaluation of alternative options and identification of the best available technologies” [no. 18-410-400001].

Author information

Authors and Affiliations

Holon Institute of Technology, 5810201, Holon, Israel
Eugene Levner, Boris Kriheli & Arriel Benis
Ashkelon Academic College, 78211, Ashkelon, Israel
Boris Kriheli
Kaluga Branch of Moscow Bauman Technical University, 248004, Kaluga, Russia
Alexander Ptuskin
Bar Ilan University, 52900, Ramat-Gan, Israel
Amir Elalouf & Sharon Hovav
Clalit Health Services, 62098, Tel Aviv, Israel
Sharon Hovav
Ariel University, 40700, Ariel, Israel
Shai Ashkenazi

Authors

Eugene Levner
View author publications
You can also search for this author in PubMed Google Scholar
Boris Kriheli
View author publications
You can also search for this author in PubMed Google Scholar
Arriel Benis
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ptuskin
View author publications
You can also search for this author in PubMed Google Scholar
Amir Elalouf
View author publications
You can also search for this author in PubMed Google Scholar
Sharon Hovav
View author publications
You can also search for this author in PubMed Google Scholar
Shai Ashkenazi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eugene Levner .

Editor information

Editors and Affiliations

CSIRO, Marsfield, NSW, Australia
Surya Nepal
Facebook (United States), Menlo Park, CA, USA
Wenqi Cao
Chungbuk National University, Cheongju, Korea (Republic of)
Aziz Nasridinov
Fordham University, Bronx, NY, USA
MD Zakirul Alam Bhuiyan
University of North Texas, Denton, TX, USA
Xuan Guo
Kingdee International Software Group Co. Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Levner, E. et al. (2020). Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-59612-5_1
Published: 18 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59611-8
Online ISBN: 978-3-030-59612-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases

Abstract