Abstract
In this paper, we present efficient, scalable, and portable parallel algorithms for the off-line clustering, the on-line retrieval and the update phases of the Text Retrieval (TR) problem based on the vector space model and using clustering to organize and handle a dynamic document collection. The algorithms are running on the Coarse-Grained Multicomputer (CGM) and/or the Bulk Synchronous Parallel (BSP) model which are two models that capture within a few parameters the characteristics of the parallel machine. To the best of our knowledge, our parallel retrieval algorithms are the first ones analyzed under these specific parallel models. For all the phases of the proposed algorithms, we analytically determine the relevant communication and computation cost thereby formally proving the efficiency of the proposed solutions. In addition, we prove that our technique for the on-line retrieval phase performs very well in comparison to other possible alternatives in the typical case of a multiuser information retrieval (IR) system where a number of user queries are concurrently submitted to an IR system. Finally, we discuss external memory issues and show how our techniques can be adapted to the case when processors have limited main memory but sufficient disk capacity for holding their local data.
Similar content being viewed by others
References
Aslam J, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, New York
Bilardi G, Fantozzi C, Pietracaprina A, Pucci G (2001) On the effectiveness of D-BSP as a bridging model of parallel computation. In: Proc of the int conf on comput sci (ICCS 2001), pp 579–588
Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Process Syst 11:143–164
Can F, Altingovde I, Demir E (2004) Efficiency and effectiveness of query processing in cluster-based retrieval. Inf Syst 29:697–717
Chan A, Dehne F (1999) A note on coarse grained parallel integer sorting. Parallel Process Lett 9(4):533–538
Chaudhri B (1994) Dynamic clustering for time incremental data. Pattern Recognit Lett 13:27–34
Chung S, Kwon H, Ryu K, Chung Y, Jang H, Choi C (2001) Information retrieval on an SCI-based PC cluster. J Supercomput 19:251–265
Cringean J, England R, Manson G, Willett P (1990) Parallel text searching in serial files using a processor farm. In: Proc of ACM SIGIR, pp 429–452
Dehne F, Fabri A, Rau-Chaplin A (1993) Scalable parallel geometric algorithms for coarse grained multicomputer. In: Proc of the ACM 9th symp on comput geom, pp 298–307
Dehne F, Fabri A, Kenyon C (1994) Scalable and architecture independent parallel geometric algorithms with high probability optimal time. In: Proc of the IEEE symp on parallel and distrib process, pp 586–593
Dehne F, Dittrich W, Hutchinson D (1997) Efficient external memory algorithms by simulating coarse-grained parallel algorithms. In: Proc of the 9th annual ACM symp on parallel algorithms and archit (SPAA), pp 106–115
Dehne F, Dittrich W, Hutchinson D (2003) Efficient external memory algorithms by simulating coarse-grained parallel algorithms. Algorithmica 36:97–122
Dehne F, Deng X, Dymond P, Fabri A, Kokhar A (1995) A randomized parallel 3D convex hull algorithm for coarse grained multicomputers. In: Proc of the ACM symp on parallel algorithms and archit, pp 27–33
Dehne F, Hutchinson D, Maheshwari A, Dittrich W (1999) Reducing I/O complexity by simulating coarse grained parallel algorithms. In: Proc of the 13th Int and 10th symp on parallel and distrib process (IPPS/SPDP), pp 14–20
Dehne F, Dittrich W, Hutchinson D, Maheshwari A (2002) Bulk synchronous parallel algorithms for the external memory model. Theory of Comput Syst 35:567–597
Dehne F, Ferreira A, Caceres E, Wong S, Roncato A (2002) Efficient parallel graph algorithms for coarse-grained multicomputers and BSP. Algorithmica 33:183–200
Dobrynin V, Patterson D, Galushka M, Rooney N (2005) SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection. IEEE Trans Inf Technol Biomed 9(2):256–265
Efraimidis P, Glymidakis C, Mamalis B, Spirakis P, Tampakas B (1995) Parallel text retrieval on a high performance supercomputer using the vector space model. In: Proc of ACM SIGIR, pp 58–66
Gavalas D, Konstantopoulos C, Mamalis B, Pantziou G (2005) Efficient BSP/CGM algorithms for text retrieval. In: Proc of the int conf on parallel and distrib comput syst (IASTED PDCS), pp 301–306
Gerbessiotis A, Siniolakis C (1996) Primitive operations on the BSP model. Technical Report PRG-TR-23-96, Computing Laboratory, Oxford University
Gerbessiotis A, Valiant L (1994) Direct bulk synchronous parallel algorithms. J Parallel Distrib Comput 22:251–267
Goodrich M (1999) Communication-efficient parallel sorting. SIAM J Comput 29(2):416–432
Hawking D (1996) Document retrieval performance on parallel systems. In: Proc of the int conf on parallel and distrib process tech and appl (PDPTA ’96), pp 1354–1365
Ishikawa Y, Chen Y, Kitagawa H (2001) An on-line document clustering method based on forgetting factors. In: Proc of the 5th Eur conf on res and adv technol for digit libr, pp 325–339
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Juurlink B, Wijshoff H (1996) Communication primitives for BSP computers. Inf Process Lett 58:303–310
Kanellakis P, Shvartsman A (1997) Fault-tolerant parallel computation. Kluwer Academic, Norwell
Kang J, Ahn H, Jung S, Ryu K, Kwon H, Chung S (2004) Improving load balance and fault tolerance for PC cluster-based parallel information retrieval. In: Proc of the 5th int conf on parallel process and appl math (PPAM 2003), pp 682–687
Karger D, Stein C, Wein J (1997) Scheduling algorithms. In: Atallah M (ed) Algorithms and theory of computation handbook. CRC Press, New York, pp 35-1–35-33
Katriel I, Meyer U (2003) Elementary graph algorithms in external memory. In: Meyer U, Sanders P, Sibeyn J (eds) Algorithms for memory hierarchies. LNCS, vol 2625. Springer, Berlin, pp 62–84
Kleinberg J, Tardos E (2006) Algorithm design. Addison-Wesley, Boston
Kunder M (2008) The size of the World Wide Web: estimated size of Google’s index. In: The size of the World Wide Web. Available via http://www.worldwidewebsize.com. Accessed 2 June 2008
Kurland O, Lee L (2004) Corpus structure, language models, and ad hoc information retrieval. In: Proc of SIGIR’04, pp 194–201
Levering R, Cutler M (2006) The portrait of a common HTML Web page. In: Proc of the 2006 ACM symp on doc eng, pp 198–204
Liu X, Croft W (2004) Cluster-based retrieval using language models. In: Proc of SIGIR’04, pp 186–193
Lu Z, McKinley K (2000) Partial replica selection based on relevance for information retrieval. In: Proc of the 22nd Int ACM SIGIR conf on res and develop in inf retr, pp 97–104
MacFarlane A, Robertson S, McCann J (2000) Pliers at TRECS. In: Proc of eighth text retr conf (TREC-8), pp 241–252
Mamalis B, Spirakis P, Tampakas B (2003) Parallel processing of multiple text queries on hypercube interconnection networks. Int J Comput Appl 10(1):115–132
Manning C, Raghavan P, Schutze H (2008) An introduction to information retrieval, preliminary draft. Cambridge University Press. Available via http://www.informationretrieval.org. Accessed 1 June 2008
Martin J, Tiskin A (2004) Dynamic BSP: towards a flexible approach to parallel computing over the grid. In: East I, Martin J, Welch P, Duce D, Green M (eds) Communicating process architectures. IOS Press, Amsterdam, pp 219–226
McColl W (1994) Scalable parallel computing: a grand unified theory and its practical development. In: Proc of IFIP world congr, pp 539–546
Rasmussen E (1992) Clustering algorithms. In: Frakes W, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, pp 419–442
Rungsawang A, Laohakanniyom A, Lertprasertkune M (2001) Low-cost parallel text retrieval using PC-cluster. In: Proc of Eur PVM/MPI, pp 419–426
Salton G, McGill M (1983) An introduction to modern information retrieval. McGraw-Hill, New York
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Stanfill C (1989) Partitioned posting files: a parallel inverted file structure for information retrieval. In: Proc of ACM SIGIR, pp 413–428
Valiant L (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Valiant L (1991) General purpose parallel architectures. In: Leeuwen J (ed) Handbook of theoretical computer science, vol A. MIT Press, Cambridge, pp 943–971
van Rijsbergen C (1971) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton
Vitter J (2001) External memory algorithms and data structures: dealing with massive data. ACM Comput Surv 33(2):209–271
Williams T (2000) A general-purpose model for heterogeneous computation. PhD Thesis, University of Central Florida, Orlando
Williams T, Parsons R (2000) The heterogeneous bulk synchronous parallel model. In: Proc of the IPDPS 2000 workshops on parallel and distrib process, pp 102–108
Witten I, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images, 2d edn. Morgan Kaufmann, San Francisco
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Konstantopoulos, C., Mamalis, B., Pantziou, G. et al. Efficient parallel Text Retrieval techniques on Bulk Synchronous Parallel (BSP)/Coarse Grained Multicomputers (CGM). J Supercomput 48, 286–318 (2009). https://doi.org/10.1007/s11227-008-0225-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-008-0225-x