Abstract
Data fusion is the process of combining multiple representations of the same object, extracted from several external sources, into a single and clean representation. It is usually the last step of an integration process, which is executed after the schema matching and the entity identification steps. More specifically, data fusion aims at solving attribute value conflicts based on user-defined rules. Although there exist several approaches in the literature for fusing data, few of them focus on optimizing the process when new versions of the sources become available. In this paper, we propose a model for incremental data fusion. Our approach is based on storing provenance information in the form of a sequence of operations. These operations reflect the last fusion rules applied on the imported data. By keeping both the original source value and the new fused data in the operations repository, we are able to reliably detect source value updates, and propagate them to the fusion process, which reapplies previously defined rules whenever it is possible. This approach reduces the number of data items affected by source updates and minimizes the amount of user manual intervention in future fusion processes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Archer, D.W., Delcambre, L.M.L., Maier, D.: A framework for fine-grained data integration and curation, with provenance, in a dataspace. In: Proceedings of the 1st Workshop on the Theory and Practice of Provenance, pp. 1–10 (2009)
Batini, C., Lenzerini, M., Navathe, S.B.: Comparative analysis of methodologies for database schema integration. ACM Computing Surveys 18(4) (December 1986)
Benjelloun, O., Sarma, A.D., Hayworth, C., Widom, J.: An introduction to ULDBs and the Trio system. IEEE Data Engineering Bulletin 29(1), 5–16 (2006)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. IEEE Data Engineering Bulletin 29(2), 4–12 (2006)
Bilke, A., Bleiholder, J., Naumann, F., Böhm, C., Weis, M.: Automatic data fusion with hummer. In: Proceedings of the 31st VLDB Conference, pp. 1251–1254 (2005)
Bleiholder, J., Naumann, F.: Conflict handling strategies in an integrated information system. In: Proceedings of the International Workshop on Information Integration on the Web, IIWeb (2006)
Bleiholder, J., Naumann, F.: Data fusion. ACM Computing Survey 41(1), 1–41 (2008)
Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases. In: SIGMOD 2006: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2006)
Buneman, P., Chapman, A., Cheney, J., Vansummeren, S.: A provenance model for manually curated data. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 162–170. Springer, Heidelberg (2006)
Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.C.: Keys for XML. Computer Networks 39(5), 473–487 (2002)
Buneman, P., Khanna, S., Tan, W.-C.: Data provenance: Some basic issues. In: Kapoor, S., Prasad, S. (eds.) FST TCS 2000. LNCS, vol. 1974, pp. 87–93. Springer, Heidelberg (2000)
Buneman, P., Khanna, S., Tan, W.-C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)
Cao, Y., Fan, W., Yu, W.: Determining the relative accuracy of attributes. In: SIGMOD 2013: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 565–576 (2013)
Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S.: XML data fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 297–308. Springer, Heidelberg (2010)
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. The VLDB Journal 12(1), 41–58 (2003)
Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: SOLOMON: Seeking the truth via copying detection. PVLDB 3(2), 1617–1620 (2010)
Fan, W., Geerts, F., Tang, N., Yu, W.: Inferring data currency and consistency for conflict resolution. In: ICDE 2013: Proceedings of the IEEE International Conference on Data Engineering, pp. 470–481 (2013)
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing xpath queries. In: VLDB 2002: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 95–106 (2002)
Ikeda, R., Widom, J.: Panda: A system for provenance and data. IEEE Data Engineering Bulletin 33(3), 42–49 (2010)
Ikeda, R., Salihoglu, S., Widom, J.: Provenance-based refresh in data-oriented workflows. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1659–1668. ACM, New York (2011), http://doi.acm.org/10.1145/2063576.2063816
Ives, Z.G., Green, T.J., Karvounarakis, G., Taylor, N.E., Tannen, V., Talukdar, P.P., Jacob, M., Pereira, F.: The Orchestra collaborative data sharing system. SIGMOD Record 37(3), 26–32 (2008)
Lawrence, M., Pottinger, R., Staub-French, S.: Data coordination: Supporting contingent updates. Proceedings of the VLDB Endowment 4(11), 831–842 (2011)
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proceedings of the International VLDB Workshop on Clean Databases, Seoul, Korea (2006)
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Information Sciences 89(1) (1996)
Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Information Fusion 7(2), 176–196 (2006)
do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: SBBD 2008: Proceedings of the 23rd Brazilian Symposium on Databases, pp. 46–60 (2008)
Poggi, A., Abiteboul, S.: XML data integration with identification. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 106–121. Springer, Heidelberg (2005)
Prabhakar, S., Richardson, J., Srivastava, J., Lim, E.P.: Instance-level integration in federated autonomous databases. In: Hawaiian Conference for System Science (1993)
Ramalingam, G., Reps, T.W.: An incremental algorithm for a generalization of the shortest-path problem. Journal of Algorithms 21(2), 267–305 (1996)
Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 381–390 (2001)
Sellis, T.K., Skoutas, D., Simitsis, A., Vassiliadis, P.: Data provenance in ETL scenarios. In: Proceedings of the 1st Workshop on Principles of Provenance, pp. 1–3 (2007)
Shiri, N., Taghizadeh-Azari, A.: Lineage tracing in mediator-based information integration systems. In: Ramos, F.F., Larios Rosillo, V., Unger, H. (eds.) ISSADS 2005. LNCS, vol. 3563, pp. 267–282. Springer, Heidelberg (2005)
Tomazela, B., Hara, C.S., Ciferri, R.R., Ciferri, C.D.A.: Empowering integration processes with data provenance. Data & Knowledge Engineering 86, 102–123 (2013)
Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)
Widom, J.: Trio: A system for data, uncertainty, and lineage. In: Aggarwal, C. (ed.) Managing and Mining Uncertain Data, ch. 5. Springer (2009)
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Transactions on Knowledge and Data Engineering 20(6), 796–808 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hara, C.S., de Aguiar Ciferri, C.D., Ciferri, R.R. (2013). Incremental Data Fusion Based on Provenance Information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, WC., Fourman, M. (eds) In Search of Elegance in the Theory and Practice of Computation. Lecture Notes in Computer Science, vol 8000. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41660-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-41660-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41659-0
Online ISBN: 978-3-642-41660-6
eBook Packages: Computer ScienceComputer Science (R0)