Abstract
The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 106 records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191
Li F, Nath S (2014) Scalable data summarization on big data. Distrib Parallel Databases 32(3):313. https://doi.org/10.1007/s10619-014-7145-y
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795
Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, Canada
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65
Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130
Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229
Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585
Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787
R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 2 Jan 2019
Stetco A, Zeng XJ, Keane J (2015) Fuzzy c-means++: fuzzy c-means with effective seeding initialization. Expert Syst Appl 42(21):7541. https://doi.org/10.1016/j.eswa.2015.05.014
Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677
Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1
UML Repository (2017) Iris. https://archive.ics.uci.edu/ml/datasets/iris. Accessed 2 Jan 2019
UML Repository (2017) Breast cancer. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original). Accessed 2 Jan 2019
Cattral R, Oppacher F (2007) Poker hand data set. Carleton University. https://archive.ics.uci.edu/ml/datasets/Poker+Hand. Accessed 16 Aug 2017
Attila Reiss DG (2012) Pamap2 physical activity monitoring data set. Department Augmented Vision. http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring. Accessed 16 Aug 2017
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278
Blackard JA (1998) Covertype data set. Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype. Accessed 16 Aug 2017
Rajen Bhatt AD (2012) Skin data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Accessed 16 Aug 2017
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433
Jaccard P (1908) Nouvelles recherches sur la distribution florale
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553
Acknowledgements
This work has been supported by the Brazilian agency CAPES, CNPq and FAPEMIG.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Luiz C. B. Torres is a fellow of CNPq, Brazil (No. 150254/2016-4).
Rights and permissions
About this article
Cite this article
Silva, G.R.L., Neto, P.C., Torres, L.C.B. et al. A fuzzy data reduction cluster method based on boundary information for large datasets. Neural Comput & Applic 32, 18059–18068 (2020). https://doi.org/10.1007/s00521-019-04049-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04049-4