$$\partial u\partial u$$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Kathiravelu, Pradeeban; Galhardas, Helena; Veiga, Luís

doi:10.1007/978-3-319-26148-5_14

Pradeeban Kathiravelu²⁰,
Helena Galhardas²⁰ &
Luís Veiga²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9415))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

1403 Accesses
2 Citations
4 Altmetric

Abstract

Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.

Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.

In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of $\partial u\partial u$, a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. $\partial u\partial u$ leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, $\partial u\partial u$ efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Similarity Grouping in Big Data Systems

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

Scalable Blocking for Very Large Databases

References

Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Oliveira, P., Rodrigues, F., Henriques, P., Galhardas, H.: A taxonomy of data quality problems. In: 2nd Int. Workshop on Data and Information Quality, pp. 219–233 (2005)
Google Scholar
Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum 14(15–21), 48 (2005)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)
Google Scholar
Di Sanzo, P., Rughetti, D., Ciciani, B., Quaglia, F.: Auto-tuning of cloud-based in-memory transactional data grids via machine learning. In: 2012 Second Symposium on Network Cloud Computing and Applications (NCCA), pp. 9–16. IEEE (2012)
Google Scholar
Johns, M.: Getting Started with Hazelcast. Packt Publishing Ltd. (2013)
Google Scholar
Marchioni, F.: Infinispan data grid platform. Packt Publishing Ltd. (2012)
Google Scholar
Samovsky, M., Kacur, T.: Cloud-based classification of text documents using the gridgain platform. In: 2012 7th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 241–245. IEEE (2012)
Google Scholar
Seovic, A., Falco, M., Peralta, P.: Oracle Coherence 3.5. Packt Publishing Ltd. (2010)
Google Scholar
Arora, P., Khandelwal, D., Marshall, J., Usha, A., Sadtler, C., et al.: Scalable, Integrated Solutions for Elastic Caching Using IBM WebSphere eXtreme Scale. IBM Redbooks (2011)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: Language, model, and algorithms (2001)
Google Scholar
Zhang, D.Q., Chang, S.F.: Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In: Proceedings of the 12th Annual ACM International Conference on Multimedia, pp. 877–884. ACM (2004)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)
Google Scholar
Galhardas, H., Lopes, A., Santos, E.: Support for user involvement in data cleaning. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 136–151. Springer, Heidelberg (2011)
Chapter Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD Record, vol. 24, pp. 127–138. ACM (1995)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, pp. 131–140. ACM (2008)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24(9), 1537–1555 (2012)
Article Google Scholar
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: Mapdupreducer: detecting near duplicates over massive datasets. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 1119–1122. ACM (2010)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. Proceedings of the VLDB Endowment 5(12), 1878–1881 (2012)
Article Google Scholar
Lwenstein, B.: Benchmarking of Middleware Systems: Evaluating and Comparing the Performance and Scalability of XVSM (MozartSpaces), JavaSpaces (GigaSpaces XAP) and J2EE (JBoss AS). VDM Verlag (2010)
Google Scholar
Ferrante, M.: Java frameworks for high-level distributed scientific programming (2010)
Google Scholar
El-Refaey, M., Rimal, B.P.: Grid, soa and cloud computing: On-demand computing models. Computational and Data Grids: Principles, Applications, and Design, 45 (2012)
Google Scholar
Mohanty, S., Jagadeesh, M., Srivatsa, H.: Extracting value from big data: in-memory solutions, real time analytics, and recommendation systems. In: Big Data Imperatives, pp. 221–250. Springer (2013)
Google Scholar
Kathiravelu, P., Veiga, L.: An adaptive distributed simulator for cloud and mapreduce algorithms and architectures. In: 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC), pp. 79–88. IEEE (2014)
Google Scholar
Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the ghsom algorithm on the gridgain platform. In: 2013 IEEE 8th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 309–313. IEEE (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

INESC-ID Lisboa, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
Pradeeban Kathiravelu, Helena Galhardas & Luís Veiga

Authors

Pradeeban Kathiravelu
View author publications
You can also search for this author in PubMed Google Scholar
Helena Galhardas
View author publications
You can also search for this author in PubMed Google Scholar
Luís Veiga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pradeeban Kathiravelu .

Editor information

Editors and Affiliations

Trinity College Dublin, Dublin 2, Iran
Christophe Debruyne
University of Lorraine, Vandoeuvre-les-Nancy Cedex, France
Hervé Panetto
TU Graz, Graz, Austria
Robert Meersman
La Trobe University, Melbourne, Australia
Tharam Dillon
PROFACTOR GmbH, Steyr-Gleink, Austria
Georg Weichhart
Drexel University, Philadelphia, Pennsylvania, USA
Yuan An
Università degli Studi di Milano, Crema, Italy
Claudio Agostino Ardagna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kathiravelu, P., Galhardas, H., Veiga, L. (2015). $\partial u\partial u$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data. In: Debruyne, C., et al. On the Move to Meaningful Internet Systems: OTM 2015 Conferences. OTM 2015. Lecture Notes in Computer Science(), vol 9415. Springer, Cham. https://doi.org/10.1007/978-3-319-26148-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-26148-5_14
Published: 28 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26147-8
Online ISBN: 978-3-319-26148-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

\(\partial u\partial u\) Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Similarity Grouping in Big Data Systems

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

Scalable Blocking for Very Large Databases

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

\(\partial u\partial u\) Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Similarity Grouping in Big Data Systems

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

Scalable Blocking for Very Large Databases

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation