Abstract
Effective data distribution techniques can significantly reduce the total execution time of a program on grid computing environments, especially for data mining applications. In this paper, we describe a linear programming formulation for the data distribution problem on grids. Furthermore, a heuristic method, named Heuristic Data Distribution Scheme (HDDS), is proposed to solve this problem. We implement two types of data mining applications, Association Rule Mining and Decision Tree Construction, and conduct experiments on grid testbeds. Experimental results show that data mining programs using the proposed HDDS to distribute data could execute more efficiently than traditional schemes could.
Similar content being viewed by others
References
Agrawal R, Jagadish H (1988) Partition techniques for large-grained parallelism. IEEE Trans Comput 37(12):1627–1634
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proc 20th very large data bases conf, pp 487–499
Allcock B, Tuecke S, Foster I, Chervenak A, Kesselman C (2000) Protocols and services for distributed data-intensive science. ACAT2000 Proceedings, pp 161–163
Allcock W, Chervenak A, Foster I, Kesselman C, Salisbury C, Tuecke S (2001) The data grid: towards an architecture for the distributed management and analysis of large scientific datasets. J Netw Comput Appl 23:187–200
Alsabti K, Ranka S, Singh V (1998) CLOUDS: a decision tree classifier for large datasets. In: Proc KDD‘98, 4th intl conf on knowledge discovery and data mining, New York City, pp 2–8
Baker MA, Fox GC (1999) Metacomputing: harnessing informal supercomputers. In: High performance cluster computing. Prentice-Hall, New York. ISBN 0-13-013784-7
Beaumont O, Casanova H, Legrand A, Robert Y, Yang Y (2005) Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans Parallel Distrib Syst 16(3):207–218
Benoît G (2002) Data mining. In: Cronin B (ed) Annual review of information science and technology, vol 36. American Society for Information Science and Technology, Silver Spring, pp 265–310
Bharadwaj V, Ghose D, Mani V, Robertazzi TG (1996) Scheduling divisible loads in parallel and distributed systems. IEEE Press, New York
Bharadwaj V, Ghose D, Robertazzi TG (2003) Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Comput 6(1):7–18
Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P (2004) Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cybern B 34(6):2451–2465
Comino N, Narasimhan VL (2002) A novel data distribution technique for host-client type parallel applications. IEEE Trans Parallel Distrib Syst 13(2):97–110
Di Fatta G, Berthold MR (2006) Dynamic load balancing for the distributed mining of molecular structures. IEEE Trans Parallel Distrib Syst 17(8):773–785
Divisible Load Theory, http://www.ee.sunysb.edu/~tom/MATBE/index.html
Dynamic Load Distribution, http://homepages.mcs.vuw.ac.nz/~kris/thesis/node11.html
Foster I (2002) The grid: a new infrastructure for 21st century science. Phys Today 55(2):42–47
Foster I, Karonis N (1998) A grid-enabled MPI: message passing in heterogeneous distributed computing systems. In: Proc 1998 SC conference, November 1998
Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl 11(2):115–128
Foster I, Kesselman C (eds) (1999) The grid: blueprint for a new computing infrastructure, 1st edn. Morgan Kaufmann, San Mateo
Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomput Appl 15(3)
Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Globus project
Fox G (2003) Education and the enterprise with the Grid. In: Berman F, Fox G, Hey T (eds) Grid computing: making the global infrastructure a reality. Wiley, New York
Grimshaw AS (1992) Meta-systems: an approach combining parallel processing and heterogeneous distributed computing systems. Workshop on heterogeneous processing, international parallel processing symposium, pp 54–59
Hagiwara J, Doi T, Shindo T, Yaginuma Y, Maeda K (1997) Commercial applications on the AP3000 Parallel Computer. IEEE massively parallel programming models’97
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Mateo
Hinke TH, Novotny J (2000) Data mining on NASA’s information power grid. HPDC
Hinke TH, Novotny J (2000) Data mining on NASA’s information power grid. HPDC
Huang F, Li Z, Sun X (2008) A data mining model in knowledge grid. In: The 4th international conference on wireless communications, networking and mobile computing (WiCOM’08), pp 1–4, 12–14 Oct 2008
Introduction to Grid Computing with Globus, http://www.ibm.com/redbooks
KISTI Grid Testbed, http://Gridtest.hpcnet.ne.kr/
MPICH-G2, http://www.hpclab.niu.edu/mpi/
Narlikar G (1998) A parallel, multithreaded decision tree builder. Tech Report CMU-CS-98-184, December 1998
Network Weather Service, http://nws.cs.ucsb.edu/
Open Grid Forum, http://www.ogf.org/
Orlando S, Palmerini P, Perego R, Silverstri F (2002) Scheduling high performance data mining tasks on a data grid environment. Proceedings of Europar
Robertazzi TG (2003) Ten reasons to use divisible load theory. Computer 36(5):63–68
Shafer J, Agrawal R, Mehta M (1996) SPRINT: a scalable parallel classifier for data mining. In: Proc of VLDB
Shih W-C, Yang C-T, Tseng S-S (2009) Using a performance-based skeleton to implement divisible load applications on grid computing environments. J Inf Sci Eng (JISE) 25(1):59–81
Sun ONE Grid Engine, http://wwws.sun.com/software/Gridware/
Sunderam VS (1990) PVM: A framework for parallel distributed computing. Concurr Pract Exp 2(4):315–339
Talia D (2002) High-performance data mining and knowledge discovery. Euro-Par, Paderborn, Germany, August 2002
Taniar’s D Homepage, http://www-personal.monash.edu.au/~dtaniar/VPAC/parsprint.zip
TeraGrid, http://www.teraGrid.org/
The Globus Project, http://www.globus.org/
THU Bandwidth Statistics GUI, http://140.128.102.187/nws/show.jsp
Yang C-T, Shih W-C, Tseng S-S (2008) A heuristic data distribution scheme for data mining applications on grid environments. In: IEEE international conference on fuzzy systems, 2008 (FUZZ-IEEE 2008), Jun 1–6, 2008, Hong Kong, pp 2398–2404
Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. ICDE 1999, pp 198–205
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shih, WC., Yang, CT. & Tseng, SS. Performance-based data distribution for data mining applications on grid computing environments. J Supercomput 52, 171–198 (2010). https://doi.org/10.1007/s11227-009-0286-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0286-5