Abstract
One of the concepts that attracts attention since entering of big data era is the graph-structured data. Suitable frameworks to handle such data would face several constraints, especially scalability, partitioning challenges, processing complexity and hardware configurations. Unfortunately, although several works deal with big data issues, there is a lack of literature review concerning the challenges related to query answering on large-scale graph data. In this survey paper, we review current problems related to the partitioning and processing of graph-structured data. We discuss existing graph processing systems and provide some insights to know how to choose the right system for parallel and distributed processing of large-scale graph data. Finally, we survey current open challenges in this field.
Similar content being viewed by others
References
Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: LinkBench: a database benchmark based on the facebook social graph. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp. 1185–1196. ACM, New York (2013)
Watts, D.J., Strogatz, S.H.: Collective dynamics of ’small-world’ networks. Nature 393(6684), 440–442 (1998)
Travers, J., Milgram, S.: An experimental study of the small world problem. Sociometry 32(4), 425–443 (1969)
Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)
Albert, R., Barabási, A.-L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47 (2002)
Abeywickrama, T., Cheema, M.A., Taniar, D.: K-nearest neighbors on road networks: a journey in experimentation and in-memory implementation. Proc. VLDB Endow. 9(6), 492–503 (2016)
Beutel, A.: User behavior modeling with large-scale graph analysis. PhD thesis, University of Trento (2016)
Czerepicki, A.: Application of graph databases for transport purposes. Bull. Pol. Acad. Sci. Tech. Sci. 64(3), 457–466 (2016)
Miler, M., Medak, D., Odobašióc, D.: The shortest path algorithm performance comparison in graph and relational database on a transportation network. Promet Traffic Transp. 26(1), 75–82 (2014)
Have, C.T., Jensen, L.J.: Are graph databases ready for bioinformatics? Bioinformatics 29(24), 3107–3108 (2013)
Yoon, B.-H., Kim, S.-K., Kim, S.-Y.: Use of graph database for the integration of heterogeneous biological data. Genomics Inform. 15(1), 19–27 (2017)
Adoni, W.Y.H., Nahhal, T., Aghezzaf, B., Elbyed, A.: MRA*: parallel and distributed path in large-scale graph using MapReduce-A* based approach. In: Ubiquitous Networking, Lecture Notes in Computer Science, pp. 390–401. Springer, Cham (2017)
Aridhi, S., d’Orazio, L., Maddouri, M., Mephu, N.E.: Density-based data partitioning strategy to approximate large-scale subgraph mining. Inf. Syst. 48, 213–223 (2015)
Lakhotia, K., Kannan, R., Prasanna, V.: Accelerating pagerank using partition-centric processing. In: 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston (2018)
Plimpton, S.J., Devine, K.D.: MapReduce in MPI for large-scale graph algorithms. Parallel Comput. 37(9), 610–632 (2011)
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: measurements, models, and methods. In: Computing and Combinatorics, Lecture Notes in Computer Science, pp. 1–17. Springer, Berlin (1999)
Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl. Based Syst. 117, 3–15 (2016)
Guo, K., Guo, W., Chen, Y., Qiu, Q., Zhang, Q.: Community discovery by propagating local and global information based on the MapReduce model. Inf. Sci. 323, 73–93 (2015)
Moon, S., Lee, J.-G., Kang, M., Choy, M., Lee, J.-W.: Parallel community detection on large graphs with MapReduce and GraphChi. Data Knowl. Eng. 104, 17–31 (2016)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pp. 269–274. ACM, New York (2001)
Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Handbook of Big Data Technologies, pp. 457–505. Springer (2017)
Skhiri, S., Jouili, S.: Large graph mining: recent developments, challenges and potential solutions. In: European Business Intelligence Summer School, pp. 103–124. Springer (2012)
Adoni, W.Y.H., Nahhal, T., Aghezzaf, B., Elbyed, A.: The MapReduce-based approach to improve the shortest path computation in large-scale road networks: the case of A* algorithm. J. Big Data 5(1), 16 (2018)
Cossalter, M., Mengshoel, O., Selker, T.: Visualizing and understanding large-scale Bayesian networks. In: Proceedings of the 17th AAAI Conference on Scalable Integration of Analytics and Visualization, AAAIWS’11-17, pp. 12–21. AAAI Press, Menlo Park (2011)
Gantz, J., Reinsel, D.: Extracting value from chaos. IDC iView 1142(2011), 1–12 (2011)
Alekseev, V.E., Boliac, R., Korobitsyn, D.V., Lozin, V.V.: NP-hard graph problems and boundary classes of graphs. Theor. Comput. Sci. 389(1), 219–236 (2007)
Cameron, K., Eschen, E.M., Hoáng, C.T., Sritharan, R.: The complexity of the list partition problem for graphs. SIAM J. Discret. Math. 21(4), 900–929 (2008)
Yan, D., Tian, Y., Cheng, J.: Systems for Big Graph Analytics. Springer Briefs in Computer Science. Springer, Cham (2017)
Goel, A.: Neo4j Cookbook Harness the Power of Neo4j to Perform Complex Data Analysis over the Course of 75 Easy-to-Follow Recipes. Packt Publishing, Birmingham (2015)
Guerrieri, A.: Distributed computing for large-scale graphs. PhD thesis, University of Trento (2015)
Yan, D., Huang, L., Jordan, M.I.: Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 907–916. ACM, New York (2009)
Martin, C.H.: Spectral clustering: a quick overview. PhD thesis (2012)
Filippone, M., Camastra, F., Masulli, F., Rovetta, S.: A survey of kernel and spectral methods for clustering. Pattern Recognit. 41(1), 176–190 (2008)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, pp. 849–856. MIT Press, Cambridge (2001)
Kong, H., Akakin, H.C., Sarma, S.E.: A generalized Laplacian of Gaussian filter for Blob detection and its applications. IEEE Trans. Cybern. 43(6), 1719–1733 (2013)
Kamvar, S.D., Klein, D., Manning, C.D.: Spectral learning. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI’03, pp. 561–566. Morgan Kaufmann Publishers Inc., San Francisco (2003)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pp. 551–556. ACM, New York (2004)
Qiu, Y., Li, R., Li, J., Qiao, S., Wang, G., Yu, J.X., Mao, R.: Efficient structural clustering on probabilistic graphs. IEEE Trans. Knowl. Data Eng. 31, 1555–1568 (2018)
Aggarwal, C.C., Wang, H.: A survey of clustering algorithms for graph data. In: Aggarwal, C.C., Wang, H. (eds.) Managing and Mining Graph Data, vol. 40, pp. 275–301. Springer, Boston (2010)
Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49(2), 291–307 (1970)
Fiduccia, C.M., Mattheyses, R.M.: A linear-time heuristic for improving network partitions. In: Proceedings of the 19th Design Automation Conference, DAC ’82, pp. 175–181. IEEE Press, Piscataway (1982)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 359–392 (1998)
Karypis, G., Kumar, V.: Multilevel algorithms for multi-constraint graph partitioning. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, SC ’98, pp. 1–13. IEEE Computer Society, Washington, DC (1998)
Karypis, G., Kumar, V.: Multilevel K-way hypergraph partitioning. In: Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99, pp. 343–348. ACM, New York (1999)
Schloegel, K., Karypis, G., Kumar, V.: Parallel multilevel algorithms for multi-constraint graph partitioning. In: Euro-Par 2000 Parallel Processing. Lecture Notes in Computer Science, pp. 296–310. Springer, Berlin (2000)
Apache Spark-Lightning-Fast Cluster Computing. https://spark.apache.org/
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10), 95 (2010)
Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pp. 31–46. USENIX Association, Berkeley (2012)
Johnson, D.S., Aragon, C.R., McGeoch, L.A., Schevon, C.: Optimization by simulated annealing: an experimental evaluation; part I, graph partitioning. Oper. Res. 37(6), 865–892 (1989)
Rolland, E., Pirkul, H., Glover, F.: Tabu search for graph partitioning. Ann. Oper. Res. 63, 209–232 (1996)
Bui, T.N., Strite, L.C.: An ant system algorithm for graph bisection. In: Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, GECCO’02, pp. 43–51. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Maini, H., Mehrotra, K., Mohan, C., Ranka, S.: Genetic algorithms for graph partitioning and incremental graph partitioning. In: Proceedings of the 1994 ACM/IEEE Conference on Supercomputing, Supercomputing ’94, pp. 449–457. IEEE Computer Society Press, Los Alamitos (1994)
Kim, J., Hwang, I., Kim, Y.-H., Moon, B.-R.: Genetic approaches for graph partitioning: a survey. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO ’11, pp. 473–480. ACM, New York (2011)
Chen, R., Weng, X., He, B., Choi, B., Yang, M.: Network Performance Aware Graph Partitioning for Large Graph Processing Systems in the Cloud. Nanyang Technological University, Singapore (2014)
Aggarwal, C.C., Zhao, Y., Yu, P.S.: A framework for clustering massive graph streams. Stat. Anal. Data Min. 3(6), 399–416 (2010)
Tsourakakis, C., Gkantsidis, C., Radunovic, B., Vojnovic, M.: FENNEL: streaming graph partitioning for massive scale graphs. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, pp. 333–342. ACM, New York (2014)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: Distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pp. 17–30. USENIX Association, Berkeley (2012)
Rahimian, F., Payberah, A.H., Girdzijauskas, S., Jelasity, M., Haridi, S.: A distributed algorithm for large-scale graph partitioning. ACM Trans. Auton. Adapt. Syst. 10(2), 1–24 (2015)
Rahimian, F., Payberah, A.H., Girdzijauskas, S., Haridi, S.: Distributed vertex-cut partitioning. In: IFIP International Conference on Distributed Applications and Interoperable Systems, pp. 186–200. Springer (2014)
Stanton, I., Kliot, G.: Streaming graph partitioning for large distributed graphs. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 1222–1230. ACM, New York (2012)
Tashkova, K., Koros̆ec, P., S̆ilc, J.: A distributed multilevel ant-colony algorithm for the multi-way graph partitioning. Int. J. Bio-Inspired Comput. 3(5), 286–296 (2011)
White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly, Beijing (2012)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Vavilapalli, V.K., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H.: Apache Hadoop YARN: Yet Another Resource Negotiator, pp. 1–16. ACM Press, Santa Clara (2013)
Al hajj Hassan, M., Bamha, M.: Handling Limits of High Degree Vertices in Graph Processing Using MapReduce and Pregel, Research Report. Université Orléans, INSA Centre Val de Loire, LIFO EA 4022, Orléans (2017)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. VLDB Endow. 7(3), 193–204 (2013)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 135–146. ACM, New York (2010)
Sagharichian, M., Naderi, H., Haghjoo, M.: ExPregel: a new computational model for large-scale graph processing. Concur. Comput. Pract. Exp. 27(17), 4954–4969 (2015)
Ching, A.: Giraph: large-scale graph processing infrastructure on Hadoop. In: Proceedings of the Hadoop Summit, Vol. 11 of 3, Santa Clara, pp. 5–9 (2011)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.: GraphLab: A new framework for parallel machine learning. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI’10, pp. 340–349. AUAI Press, Arlington (2010)
Salihoglu, S., Widom, J.: Gps: a graph processing system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM, vol. 22, pp. 1–12. ACM, New York (2013)
Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014)
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES ’13, pp. 1–6. ACM, New York (2013)
Junghanns, M., Petermann, A., Gómez, K., Rahm, E.: GRADOOP: Scalable Graph Data Management and Analytics with Hadoop. CoRR abs/1506.00548
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43. ACM, New York (2003)
Schank, T., Wagner, D.: Finding, counting and listing all triangles in large graphs, an experimental study. In: Nikoletseas, S.E. (ed.) Experimental and Efficient Algorithms. Lecture Notes in Computer Science, pp. 606–609. Springer, Berlin (2005)
Jain, N., Liao, G., Willke, T.L.: Graphbuilder: scalable graph ETL framework. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES ’13, pp. 1–6. ACM, New York (2013)
Chonbodeechalermroong, A., Hewett, R.: Towards visualizing big data with large-scale edge constraint graph drawing. Big Data Res. 10, 21–32 (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Adoni, H.W.Y., Nahhal, T., Krichen, M. et al. A survey of current challenges in partitioning and processing of graph-structured data in parallel and distributed systems. Distrib Parallel Databases 38, 495–530 (2020). https://doi.org/10.1007/s10619-019-07276-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-019-07276-9