Abstract
Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities exploiting the availability of structured (and semantic) data in Web search, is to enrich it with entity mining over the full contents of the search results. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. This is especially important in areas of professional search such as medical search, patent search, etc. In this paper we consider a general scenario of providing such services as meta-services (that is, layered over systems that support keywords search) without a-priori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions. A key contribution of our work is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. Finally we report experimental results about the achieved speedup in various settings.






















Similar content being viewed by others
Notes
In our implementation any system that supports OpenSearch [14] can straightforwardly be used.
By September 2011, datasets from Linked Open Data (http://linkeddata.org/) had grown to 31 billion RDF triples, interlinked by around 504 million RDF links.
We chose Bing because it does not limit the number of queries submitted, in contrast to Google, which blocks the account for one hour if more than 600 queries are submitted.
This size is chosen to ensure full utilization in all cases, as max(Reusability)×max(Split size)=20×5 MB=100 MB.
Even with the full functionality, the reduce phase constitutes only about 2 % of the total job time when analyzing 300 MB-SET1 using 4 nodes.
In particular: Person, Location, Organization, Address, Date, Time, Money, Percent, Age, Drug.
References
Allocca, C., dAquin, M., Motta, E.: Impact of using relationships between ontologies to enhance the ontology search results. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) The Semantic Web: Research and Applications. Lecture Notes in Computer Science, vol. 7295, pp. 453–468. Springer, Berlin (2012)
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. pages 483–485, 1967
Apache Software Foundation: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. http://hadoop.apache.org/. Accessed: 03/05/2013
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)
Assel, M., Cheptsov, A., Gallizo, G., Celino, I., Dell’Aglio, D., Bradeško, L., Witbrock, M., Della Valle, E.: Large knowledge collider—a service-oriented platform for large-scale semantic reasoning. In: Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS’11), pp. 41:1–41:9. ACM, New York (2011)
Bonino, D., Ciaramella, A., Corno, F.: Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Pat. Inf. 32(1), 30–38 (2010)
Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)
Callaghan, G., Moffatt, L., Szasz, S.: General architecture for text engineering. http://gate.ac.uk/. Accessed: 03/04/2013
Callan, J.: Distributed information retrieval. Advances in Information Retrieval, 7, 127–150, 2002
Caputo, A., Basile, P., Semeraro, G.: Boosting a semantic search engine by named entities. In: Proceedings of the 18th International Symposium on Foundations of Intelligent Systems (ISMIS’09), pp. 241–250. Springer, Berlin (2009)
Carpineto, C., DAmico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering versus diversification of search results. Inf. Process. Manag. 48(2), 358–373 (2012)
Chen, S., Schlosser, S.W.: Map-reduce meets wider varieties of applications. Technical report IRP-TR-08-05, Intel Research Pittsburgh (2008)
Cheng, T., Yan, X., Chang, K.: Supporting entity search: a large-scale prototype search engine. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD’07), pp. 1144–1146. ACM, New York (2007)
Clinton, D., Tesler, J., Fagan, M., Snell, J., Suave, A., et al.: OpenSearch is a collection of simple formats for the sharing of search results. http://www.opensearch.org/. Accessed: 03/05/2013
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02) (2002)
Das, D., Martins, A.: A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU 4, 192–195 (2007)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Ernde, B., Lebel, M., Thiele, C., Hold, A., Naumann, F., Barczyn’ski, W., Brauer, F.: ECIR—a lightweight approach for entity-centric information retrieval. In: Proceedings of the 18th Text REtrieval Conference (TREC 2010) (2010)
Fafalios, P., Kitsos, I., Marketakis, Y., Baldassarre, C., Salampasis, M., Tzitzikas, Y.: Web searching with entity mining at query time. In: Proceedings of the 5th Information Retrieval Facility Conference (IRFC 2012), Vienna (2012)
Fafalios, P., Salampasis, M., Tzitzikas, Y.: Exploratory patent search with faceted search and configurable entity mining. In: Proceedings of the 1st International Workshop on Integrating IR Technologies for Professional Search (ECIR 2013) (2013)
Grossman, R.L., Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. CoRR, abs/0808.3019:920–927, 2008
Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)
Herzig, D.M., Tran, T.: Heterogeneous web data search using relevance-based on the fly data integration. In: Proceedings of the 21st International Conference on World Wide Web (WWW ’12), pp. 141–150. ACM, New York (2012)
Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: 2010 IEEE 3rd International Conference on Clod Computing (CLOUD), pp. 1–10. IEEE Press, New York (2010)
Hwang, J.: IBM pattern modeling and analysis tool for Java garbage collector. https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=22d56091-3a7b-4497-b36e-634b51838e11 Accessed: 28/01/2013
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. Proc. VLDB Endow. 5(2), 109–120 (2011)
Jiménez-Ruiz, E., Grau, B.C., Horrocks, I., Berlanga, R.: Ontology integration using mappings: towards getting the right logical consequences. In: The Semantic Web: Research and Applications, pp. 173–187. Springer, Berlin (2009)
Joho, H., Azzopardi, L., Vanderbauwhede, W.: A survey of patent users: an analysis of tasks, behavior, search functionality and system requirements. In: Proc. of the 3rd Symposium on Information Interaction in Context, pp. 13–24. ACM, New York (2010)
Käki, M.: Findex: search result categories help users when document ranking fails. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 131–140. ACM, New York (2005)
Käki, M., Aula, A.: Findex: improving search result use through automatic filtering categories. Interact. Comput. 17(2), 187–206 (2005)
Kitsos, I., Papaioannou, A., Tsikoudis, N., Magoutis, K.: Adapting data-intensive workloads to generic allocation policies in cloud infrastructures. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium (NOMS 2012), pp. 25–33. IEEE Press, New York (2012)
Kohn, A., Bry, F., Manta, A., Ifenthaler, D.: Professional Search: Requirements, Prototype and Preliminary Experience Report, pp. 195–202. 2008
Kules, B., Capra, R., Banta, M., Sierra, T.: What do exploratory searchers look at in a faceted search interface? In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 313–322. ACM, New York (2009)
Kulkarni, P.: Distributed SPARQL query engine using MapReduce. Master’s thesis
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11), pp. 985–996. ACM, New York (2011)
Marketakis, Y., Tzanakis, M., Tzitzikas, Y.: Prescan: towards automating the preservation of digital objects. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems (MEDES’09), pp. 60:404–60:411. ACM, New York (2009)
Massie, M., Chun, B., Culler, D.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Massie, M., Li, B., Nicholes, B., Vuksan, V., Alexander, R., Buchbinder, J., Costa, F., Dean, A., Josephsen, D., Phaal, P., et al.: Monitoring with Ganglia. O’Reilly Media, Inc., Sebastopol (2012)
McCreadie, R., Macdonald, C., Ounis, I.: Comparing distributed indexing: to mapreduce or not? In: Proc. of LSDS-IR, pp. 41–48 (2009)
Mccreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: studying scalability and efficiency. Inf. Process. Manag. 48(5), 873–888 (2012)
Mika, P., Tummarello, G.: Web semantics in the clouds. IEEE Intell. Syst. 23(5), 82–87 (2008)
Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Mining Text Data, pp. 43–76 (2012)
Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Eighth IEEE International Conference on Data Mining (ICDM’08), pp. 512–521. IEEE Press, New York (2008)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD’09), pp. 165–178. ACM, New York (2009)
Phaal, P.: SFlow is an industry standard technology for monitoring high speed switched networks. http://blog.sflow.com/. Accessed: 03/05/2013
Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved Histograms for Selectivity Estimation of Range Predicates vol. 25, pp. 294–305. ACM, New York (1996)
Pratt, W., Fagan, L.: The usefulness of dynamically categorizing search results. J. Am. Med. Inform. Assoc. 7(6), 605–617 (2000)
Ramachandran, S.: Google developers: Web metrics. https://developers.google.com/speed/articles/web-metrics. Accessed: 03/05/2013
Sacco, G., Tzitzikas, Y.: Dynamic Taxonomies and Faceted Search. Springer, Berlin (2009)
Thakker, D., Osman, T., Lakin, P.: Java annotation patterns engine. http://en.wikipedia.org/wiki/JAPE_(linguistics). Accessed: 03/04/2013
Tom, W.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2009)
Tzitzikas, Y., Meghini, C.: Ostensive automatic schema mapping for taxonomy-based peer-to-peer systems. In: Cooperative Information Agents VII, pp. 78–92. Springer, Berlin (2003)
Tzitzikas, Y., Spyratos, N., Constantopoulos, P.: Mediators over taxonomy-based information sources. VLDB J. 14(1), 112–136 (2005)
Urbani, J., Kotoulas, S., Oren, E., Van Harmelen, F.: Scalable distributed reasoning using Mapreduce. pp. 634–649 (2009)
van Zwol, R., Garcia Pueyo, L., Muralidharan, M., Sigurbjörnsson, B.: Machine learned ranking of entity facets. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), pp. 879–880. ACM, New York (2010)
Venner, J.: Pro Hadoop. Apress, Berkeley (2009)
White, R.W., Kules, B., Drucker, S.M., Schraefel, M.: Supporting exploratory search, introduction (special issue). Communications of the ACM. Commun. ACM 49(4), 36–39 (2006)
Wilson, M., et al.: A longitudinal study of exploratory and keyword search. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’08), pp. 52–56. ACM, New York (2008)
Yahoo! Inc. Chaining jobs. http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining. Accessed: 09/05/2013
Zhai, K., Boyd-Graber, J., Asadi, N., Alkhouja, M.: Mr. LDA: a flexible large scale topic modeling package using variational inference in Mapreduce. In: Proceedings of the 21st International Conference on World Wide Web (WWW’12), pp. 879–888. ACM, New York (2012)
Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in Mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM, New York (2012)
Acknowledgements
Many thanks to Carlo Allocca and to Pavlos Fafalios for their contributions. We thankfully acknowledge the support of the iMarine (FP7 Research Infrastructures, 2011–2014) and PaaSage (FP7 Integrated Project 317715, 2012–2016) EU projects and of Amazon Web Services through an Education Grant. We also acknowledge the interesting discussions we had in the context of the MUMIA COST action (IC1002, 2010–2014).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Feifei Li and Suman Nath.
Rights and permissions
About this article
Cite this article
Kitsos, I., Magoutis, K. & Tzitzikas, Y. Scalable entity-based summarization of web search results using MapReduce. Distrib Parallel Databases 32, 405–446 (2014). https://doi.org/10.1007/s10619-013-7133-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-013-7133-7