Abstract
The search for a good machine learning (ML) model takes a long time and requires the considerations of many alternatives, including data preprocessing, algorithm selection, and hyperparameter tuning methods. Thus, tedious searches face a combinatorial explosion problem. In this work, we build a new automated machine learning (AutoML) system called CF-DAML, a distributed automated system based on collaborative filtering (CF), to address these challenges by recommending and training suitable models for supervised learning tasks. CF-DAML first computes some informative meta-features for a new dataset, then uses a weighted \(l_1\)-norm (W1-norm) to accurately calculate the k nearest neighbors (kNN) of the new dataset, and finally recommends the top N models with good performances on each of its neighbors to the new dataset. We also design a distributed system (DSTM) for training the models to reduce the time complexity substantially. In addition, we develop a multilayer selective stacked ensemble system (MSSE), whose base models are selected from among suitable candidate models based on their runtimes, classification accuracies, and diversities, to enhance the stability of CF-DAML. To our knowledge, this is the first work to combine memory-based CF and the selective stacked ensemble to solve the AutoML problem. Extensive experiments are conducted on many UCI datasets and the comparative results demonstrate that our approach outperforms the current state-of-the-art methods.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarwal A, Chauhan M, et al (2017) Similarity measures used in recommender systems: a study. International Journal of Engineering Technology Science and Research IJETSR :2394–3386
Ahuja S, Panigrahi BK, Dey N, Rajinikanth V, Gandhi TK (2021) Deep transfer learning-based automated detection of covid-19 from lung ct scan slices. Applied Intelligence 51(1):571–585
Alshammari G, Kapetanakis S, Polatidis N, Petridis M (2018) A triangle multi-level item-based collaborative filtering method that improves recommendations. In: International conference on engineering applications of neural networks. Springer, pp 145–157
Asuncion A, Newman D (2007) UCI machine learning repository. Irvine, CA,
Aziz ZA, Abdulqader DN, Sallow AB, Omer HK (2021) Python parallel processing and multiprocessing: A rivew. Academic Journal of Nawroz University 10(3):345–354
Bergstra J, Yamins D, Cox D (2013) Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: International conference on machine learning. PMLR, pp 115–123
Brazdil P, Carrier CG, Soares C, Vilalta R (2008) Metalearning: Applications to data mining. Springer Science & Business Media, Berlin
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI et al (2020) ilearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of dna, rna and protein sequence data. Briefings in bioinformatics 21(3):1047–1057
Cui Z, Xu X, Fei X, Cai X, Cao Y, Zhang W, Chen J (2020) Personalized recommendation system based on collaborative filtering for iot scenarios. IEEE Transactions on Services Computing 13(4):685–695
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7:1–30
Dunjko V, Briegel HJ (2018) Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Reports on Progress in Physics 81(7):074001
Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F (2015) Efficient and robust automated machine learning. In: Proceedings of the 28th international conference on neural information processing systems - Volume 2, NIPS’15. MIT Press, Cambridge, pp 2755-2763
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association 32(200):675–701
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11(1):86–92
Fusi N, Sheth R, Elibol M (2018) Probabilistic matrix factorization for automated machine learning. Advances in neural information processing systems 31:3348–3357
Gogas P, Papadimitriou T (2021) Machine learning in economics and finance. Computational Economics 57(1):1–4
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18
Han M, Park J, Baek W (2020) Design and implementation of a criticality-and heterogeneity-aware runtime system for task-parallel applications. IEEE Transactions on Parallel and Distributed Systems 32(5):1117–1132
Han ST, Yingjiao R, Dongliang P, Mengfan X, Yunfei G (2020) A novel variable structure multi-model approach based on error-ambiguity decomposition. Chinese Journal of Aeronautics 33(6):1731–1746
Jain G, Mahara T, Tripathi KN (2020) A survey of similarity measures for collaborative filtering-based recommender system. Soft Computing: Theories and Applications :343–352
Kant S, Mahara T (2018) Merging user and item based collaborative filtering to alleviate data sparsity. International Journal of System Assurance Engineering and Management 9(1):173–179
Karabadji NEI, Beldjoudi S, Seridi H, Aridhi S, Dhifli W (2018) Improving memory-based user collaborative filtering with evolutionary multi-objective optimization. Expert Systems with Applications 98:153–165
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30:3146–3154
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai, vol 14. Montreal, Canada, pp 1137–1145
Komer B, Bergstra J, Eliasmith C (2014) Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In: ICML workshop on AutoML, vol 9. Citeseer, p 50
Krogh A, Vedelsby J et al (1995) Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems 7:231–238
Li S, Zhou X, Shi H, Pan F, Li X, Zhang Y (2018) Multimode processes monitoring based on hierarchical mode division and subspace decomposition. The Canadian Journal of Chemical Engineering 96(11):2420–2430
Liu J, Jiang C, Zheng J (2021) Batch bayesian optimization via adaptive local search. Applied Intelligence 51(3):1280–1295
Maher M, Sakr S (2019) Smartml: A meta learning-based framework for automated selection and hyperparameter tuning for machine learning algorithms. In: EDBT: 22nd international conference on extending database technology
Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University, Princeton
Nguyen V, Gupta S, Rana S, Li C, Venkatesh S (2019) Filtering bayesian optimization approach in weakly specified search space. Knowledge and Information Systems 60(1):385–413
Olson RS, Moore JH (2016) Tpot: A tree-based pipeline optimization tool for automating machine learning. In: Workshop on automatic machine learning. PMLR, pp 66–74
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. Journal of machine Learning research 12:2825–2830
Prabuchandran K, Penubothula S, Kamanchi C, Bhatnagar S (2021) Novel first order bayesian optimization with an application to reinforcement learning. Applied Intelligence 51(3):1565–1579
Rahmel J et al (2020) Applying artificial intelligence in finance and asset management: A discussion of status quo and the way forward. Journal of Financial Transformation 51:67–74
Ran SJ, Tirrito E, Peng C, Chen X, Tagliacozzo L, Su G, Lewenstein M (2020) Tensor network contractions: methods and applications to quantum many-body systems. Springer Nature, Berlin
Reif M, Shafait F, Dengel A (2012) Meta-learning for evolutionary parameter optimization of classifiers. Machine learning 87(3):357–380
Rodríguez A, Navarro A, Asenjo R, Corbera F, Gran R, Suárez D, Nunez-Yanez J (2020) Parallel multiprocessing and scheduling on the heterogeneous xeon+ fpga platform. The Journal of Supercomputing 76(6):4645–4665
Ryo M, Jeschke JM, Rillig MC, Heger T (2020) Machine learning with the hierarchy-of-hypotheses (hoh) approach discovers novel pattern in studies on biological invasions. Research synthesis methods 11(1):66–73
van der Schaar M, Alaa AM, Floto A, Gimson A, Scholtes S, Wood A, McKinney E, Jarrett D, Lio P, Ercole A (2021) How artificial intelligence and machine learning can help healthcare systems respond to covid-19. Machine Learning 110(1):1–14
Schütt KT, Chmiela S, von Lilienfeld OA, Tkatchenko A, Tsuda K, Müller KR (2020) Scalone: machine learning meets quantum physics. Springer, Berlin
Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N (2015) Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1):148–175
Shen J, Zhou T, Chen L (2020) Collaborative filtering-based recommendation system for big data. International Journal of Computational Science and Engineering 21(2):219–225
Shi J, Yu T, Goebel K, Wu D (2021) Remaining useful life prediction of bearings using ensemble learning: The impact of diversity in base learners and features. Journal of Computing and Information Science in Engineering 21(2):021004
Shvets AT (2020) Multiprocessing with tasks. In: Beginning Ada programming. Springer, pp 167–194
Singh PK, Sinha M, Das S, Choudhury P (2020) Enhancing recommendation accuracy of item-based collaborative filtering using bhattacharyya coefficient and most similar item. Applied Intelligence 50(12):4708–4731
Srifi M, Oussous A, Ait Lahcen A, Mouline S (2020) Recommender systems based on collaborative filtering using review textsa survey. Information 11(6):317
Stocker S, Csányi G, Reuter K, Margraf JT (2020) Machine learning in chemical reaction space. Nature communications 11(1):1–11
Sun T, Zhou ZH (2018) Structural diversity for decision tree ensemble learning. Frontiers Comput. Sci. 12(3):560–570
Székely GJ, Rizzo ML et al (2009) Brownian distance covariance. The annals of applied statistics 3(4):1236–1265
Tamke M, Nicholas P, Zwierzycki M (2018) Machine learning for architectural design: Practices and infrastructure. International Journal of Architectural Computing 16(2):123–143
Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp 847–855
Tian Z, Luo C, Qiu J, Du X, Guizani M (2020) A distributed deep learning system for web attack detection on edge devices. IEEE Transactions on Industrial Informatics 16(3):1963–1971. https://doi.org/10.1109/TII.2019.2938778
Ullah Z, Al-Turjman F, Mostarda L, Gagliardi R (2020) Applications of artificial intelligence and machine learning in smart cities. Computer Communications 154:313–323
Valcarce D, Landin A, Parapar J, Barreiro Á (2019) Collaborative filtering embeddings for memory-based recommender systems. Engineering Applications of Artificial Intelligence 85:347–356
Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS (2020) A survey on distributed machine learning. ACM Computing Surveys (CSUR) 53(2):1–33
Wang D, Yih Y, Ventresca M (2020) Improving neighbor-based collaborative filtering by using a hybrid similarity measurement. Expert Systems with Applications 160:113651
Wang Y, Deng J, Gao J, Zhang P (2017) A hybrid user similarity model for collaborative filtering. Information Sciences 418:102–118
Wang Y, Wang D, Geng N, Wang Y, Yin Y, Jin Y (2019) Stacking-based ensemble learning of decision trees for interpretable prostate cancer detection. Applied Soft Computing 77:188–204
Wolpert DH (1992) Stacked generalization. Neural networks 5(2):241–259
Wu X, Zhang J, Wang FY (2020) Stability-based generalization analysis of distributed learning algorithms for big data. IEEE Transactions on Neural Networks and Learning Systems 31(3):801–812. https://doi.org/10.1109/TNNLS.2019.2910188
Xie Y, He M, Ma T, Tian W (2021) Optimal distributed parallel algorithms for deep learning framework tensorflow. Applied Intelligence :1–21
Yang C, Akimoto Y, Kim DW, Udell M (2019) Oboe: Collaborative filtering for automl model selection. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. pp 1173–1183
Yang C, Fan J, Wu Z, Udell M (2020) Automl pipeline selection: Efficiently navigating the combinatorial space. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp 1446–1456
Yang C, Guo J, Zhang M et al (2018) Adaptive terminal sliding mode control method based on rbf neural network for operational auv and its experimental research. Robot 40(3):336–345
Yu M, Quan T, Peng Q, Yu X, Liu L (2021) A model-based collaborate filtering algorithm based on stacked autoencoder. Neural Computing and Applications :1–9
Yue W, Wang Z, Liu W, Tian B, Lauria S, Liu X (2021) An optimally weighted user-and item-based collaborative filtering approach to predicting baseline data for friedreichs ataxia patients. Neurocomputing 419:287–294
Yue W, Wang Z, Tian B, Pook M, Liu X (2020) A hybrid model-and memory-based collaborative filtering algorithm for baseline data prediction of friedreich’s ataxia patients. IEEE Transactions on Industrial Informatics 17(2):1428–1437
Zhang J, Lin Y, Lin M, Liu J (2016) An effective collaborative filtering algorithm based on user preference clustering. Applied Intelligence 45(2):230–240
Zhang Z, Zhang Y, Ren Y (2020) Employing neighborhood reduction for alleviating sparsity and cold start problems in user-based collaborative filtering. Information Retrieval Journal 23(4):449–472
Acknowledgements
This work was supported in part by the National Key R&D Program of China under Grant No. 2019YFB1706202.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: The models and their hyperparameters in the DataBase
Each number in brackets in the first column indicates times that the associated model appears in the DataBase. The second column contains hyperparameters of the models, where \({\{a, b,...\}}\) indicates discrete values; [a, b] represents values of a, a + 1, a + 2, ..., b; U(a, b) represents a uniform distribution in the region (a, b); \(e^{\lambda }, (a, b)\) represent a value distribution obeying an exponential distribution with the parameter \(\lambda\) and the value ranging from a to b. Preprocessors and their hyperparameter configurations are shown in Table 8, classifiers and their hyperparameter configurations are shown in Table 9.
Appendix B: The distance correlation coefficients between selected meta-features
Here, we present the Dcc values for DatasetRatio (DR), InverseDatasetRatio (IDR), LogDatasetRatio (LDR), and LogInverseDatasetRatio (LIDR), in Table 10; for LogNumberOfFeatures (LOF) and NumberOfFeatures (NOF), in Table 11; for LogNumberOfInstances (LONI) and NumberOfInstances (NOI), in Table 12; for RatioNominalToNumerical (RNoTNu) and RatioNumericalToNominal (RNuTNo), in Table 13; for NumberOfClasses (NOC) and ClassEntropy (CE), in Table 14; for NumberOfCategoricalFeatures (NOCF) and SymbolsSum (SS), in Table 15; for NumberOfNumericFeatures (NuONuF) and LogNumberOfFeatures (LNuOF), in Table 16; and for PCAKurtosisFirstPC (PCAKFPC) and KurtosisMin (KM), in Table 17.
The physical meanings of DR, LDR, IDR, and LIDR involve comparisons between the numbers of rows and columns in the datasets. The correlation coefficients between LDR and the other three meta-features are very large; therefore, LDR is chosen, and the other three meta-features are discarded (Tables 18 and 19).
Appendix C: Detailed attributes of the experimental datasets
Rights and permissions
About this article
Cite this article
Liu, P., Pan, F., Zhou, X. et al. CF-DAML: Distributed automated machine learning based on collaborative filtering. Appl Intell 52, 17145–17169 (2022). https://doi.org/10.1007/s10489-021-03049-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03049-z