Abstract
In statistical databases and data warehousing applications it is commonly the case that aggregate views are maintained as an underlying mechanism for summarising information. Where the databases or applications are distributed, or arise from independent data collections or system developments, there may be incompatibility, heterogeneity, and data inconsistency. These challenges need to be overcome if federations of aggregated databases are to be successfully incorporated into systems for database management, querying, retrieval, and knowledge discovery.
In this paper we address the issue of integrating aggregate views that have semantically heterogeneous classification schemes. In previous work we have developed a methodology that is efficient but that cannot easily handle data inconsistencies. Our previous approach is therefore not particularly well-suited to very large databases or federations of large numbers of databases. We now address these scalability issues by introducing a methodology for heterogeneous aggregate view integration that constructs a dynamic shared ontology to which each of the aggregate views can be explicitly related. A maximum likelihood technique, implemented using the EM (Expectation-Maximisation) algorithm, is used to inherently handle data inconsistencies in the computation of integrated aggregates that are described in terms of the dynamic shared ontology.
Similar content being viewed by others
References
Anand, S.S., Scotney, B.W., Tan, M.G., McClean, S.I., Bell, D.A., Hughes, J.G., Magill, I.C.: Designing a kernel for data mining. IEEE Expert March-April, 65–74 (1997)
AnHai, D., Pedro, D., Alon, Y.H.: Reconciling schemas of disparate data sources: a machine-learning approach. In: ACM SIGMOD Conf. on Management of Data, pp. 509–520. Assoc. Comput. Mach., New York (2001)
Bergamaschi, S., et al.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001)
Caragea, D., et al.: Information integration from semantically heterogeneous biological data sources. In: Proceedings of the 16th Intl. Workshop on Database and Expert Systems Applications, Las Vegas, Nevada, pp. 580–584 (2005)
Chen, R., Krishnamoorthy, S.: A new algorithm for learning parameters of a Bayesian Network from distributed data. In: IEEE International Conference on Data Mining, Maebashi, Japan, pp. 585–588 (2002)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: data models and languages. J. Intell. Inf. Syst. 8(2), 117–132 (1997)
Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI Press/MIT Press, Cambridge (2000)
Kittler, J., et al.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–238 (1998)
Levy, A.: The information manifold approach to data integration. IEEE Intell. Syst. 1312–1316 (1998)
Lim, E.-P., Srivastava, J., Shekhar, S.: An evidential reasoning approach to attribute value conflict resolution in database management. IEEE Trans. Knowl. Data Eng. 8, 707–723 (1996)
Malvestuto, F.M.: The derivation problem for summary data. In: Proc. ACM-SIGMOD Conf. on Management of Data, pp. 82–89. Assoc. Comput. Mach., New York (1988)
McClean, S.I., Scotney, B.W.: Using evidence theory for the integration of distributed databases. Int. J. Intell. Syst. 12(10), 763–776 (1997)
McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of imprecise and uncertain information in databases. IEEE Trans. Knowl. Data Eng. 13(6), 902–912 (2001)
McClean, S.I., Scotney, B.W., Greer, K.R.C.: A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans. Knowl. Data Eng. 15(1), 232–235 (2003)
McClean, S.I., Scotney, B.W., Morrow, P.J., Greer, K.R.C.: Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl. Eng. 54, 189–210 (2005)
Sadreddini, M.H., Bell, D.A., McClean, S.I.: A model for integration of raw data and aggregate views in heterogeneous statistical databases. Database Technol. 4(2), 115–127 (1991)
Sadreddini, M.H., Bell, D.A., McClean, S.I.: A framework for query optimization in distributed statistical databases. Inf. Softw. Technol. 6, 363–377 (1992)
Scotney, B.W., McClean, S.I.: Efficient knowledge discovery through the integration of heterogeneous data. Inf. Softw. Technol. 41, 569–578 (1999). Special Issue-Knowledge Discovery and Data Mining
Scotney, B.W., McClean, S.I., Rodgers, M.C.: Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl. Eng. 29, 337–350 (1999)
Tsoumakas, G., Angelis, L., Vlahavas, I.: Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl. Eng. 49(3), 223–242 (2004)
Vardi, Y., Lee, D.: From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion), J. R. Stat. Soc. Ser. B 569–612 (1993)
Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: a crossmine approach. IEEE Trans. Knowl. Data Eng. 18(6), 770–783 (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by: Ahmed K. Elmagarmid.
Rights and permissions
About this article
Cite this article
McClean, S., Scotney, B., Morrow, P. et al. Integrating semantically heterogeneous aggregate views of distributed databases. Distrib Parallel Databases 24, 73–94 (2008). https://doi.org/10.1007/s10619-008-7031-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-008-7031-6