Abstract
Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.





Similar content being viewed by others
References
Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
Aggarwal CC, Bhuiyan MA, Al Hasan M (2014) Frequent pattern mining algorithms: a survey. In: Aggarwal CC, Han J (eds) Frequent pattern mining. Springer, Berlin, pp 19–64
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB. vol 1215
Anshu C, Raghuvanshi CS (2010) An algorithm for frequent pattern mining based on apriori. (IJCSE) Int J Comput Sci Eng 2(04):942–947
Borgelt C (2003) Efficient implementations of apriori and eclat. In: FIMI’03: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations
Lin X (2014) MR-apriori: association rules algorithm based on mapreduce. In: 2014 5th IEEE international conference on software engineering and service science (ICSESS). IEEE
Woo J (2012) Apriori-map/reduce algorithm. In: Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA). The steering committee of the world congress in computer science, computer engineering and applied computing (WorldComp)
Lin M-Y, Lee P-Y, Hsueh S-C (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th international conference on ubiquitous information management and communication. ACM
Li N et al (2012) Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing. IEEE
Yahya O, Hegazy O, Ezat E (2012) An efficient implementation of Apriori algorithm based on Hadoop–Mapreduce model. Int J Rev Comput 12:1–9
Apache hadoop (2013) http://hadoop.apache.org/
Apache Spark (2016) Lightning-fast cluster computing. The Apache Software Foundation, http://spark.apache.org/. Spark1.6.1, 9 March 2016, Web
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media Inc., Sebastopol
Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: BigData conference
Yu K-M et al (2010) A load-balanced distributed parallel mining algorithm. Expert Syst Appl 37(3):2459–2464
Thabtah F, Hammoud S (2012) Mr-arm: a map-reduce association rule mining framework. Parallel Process Lett 23(03):1350012
Aouad LM, Le-Khac N-A, Kechadi TM (2010) Performance study of distributed Apriori-like frequent itemsets mining. Knowl Inf Syst 23(1):55–72
Chen Z, Cai S, Song Q, Zhu C (2011) An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2nd international conference on artificial intelligence, Management Science and Electronic Commerce (AIMSEC), Dengleng, 2011, pp 1908–1911
Zhang F et al (2015) A distributed frequent itemset mining algorithm using Spark for big data analytics. Clust Comput 18(4):1493–1501
Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: Parallel and distributed processing symposium workshops (IPDPSW), 2014 IEEE international. IEEE
Rathee S, Kaul M, Kashyap A (2015) R-Apriori: an efficient apriori based algorithm on spark. In: Proceedings of the 8th workshop on Ph.D. workshop in information and knowledge management. ACM
Dharavath R, Raj S (2017) Quantitative analysis of frequent itemsets using Apriori algorithm on Apache Spark framework. In: Behera HS, Mohapatra DP (eds) Computational intelligence in data mining. Springer, Singapore, pp 261–272
Sethi KK, Ramesh D (2017) HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J Supercomput 73(8):3652–3668
Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proceedings 2001 IEEE international conference on data mining. IEEE
Dataset (2018) http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
Dataset (2018) http://fimi.ua.ac.be/data/
Acknowledgements
The authors would like to express their sincere thanks to the editor and anonymous reviewers for their valuable suggestions, which helped us to improve the paper significantly. The authors would like to express their heartiest thanks to the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for providing their research support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Raj, S., Ramesh, D., Sreenu, M. et al. EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowl Inf Syst 62, 3565–3583 (2020). https://doi.org/10.1007/s10115-020-01464-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01464-1