CF-DAML: Distributed automated machine learning based on collaborative filtering

Liu, Pengjie; Pan, Fucheng; Zhou, Xiaofeng; Li, Shuai; Jin, Liang

doi:10.1007/s10489-021-03049-z

CF-DAML: Distributed automated machine learning based on collaborative filtering

Published: 31 March 2022

Volume 52, pages 17145–17169, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Pengjie Liu^1,2,3,4,
Fucheng Pan^1,2,3,
Xiaofeng Zhou ORCID: orcid.org/0000-0001-9837-1261^1,2,3,
Shuai Li^1,2,3,4 &
…
Liang Jin^1,2,3

433 Accesses
1 Altmetric
Explore all metrics

Abstract

The search for a good machine learning (ML) model takes a long time and requires the considerations of many alternatives, including data preprocessing, algorithm selection, and hyperparameter tuning methods. Thus, tedious searches face a combinatorial explosion problem. In this work, we build a new automated machine learning (AutoML) system called CF-DAML, a distributed automated system based on collaborative filtering (CF), to address these challenges by recommending and training suitable models for supervised learning tasks. CF-DAML first computes some informative meta-features for a new dataset, then uses a weighted \(l_1\)-norm (W1-norm) to accurately calculate the k nearest neighbors (kNN) of the new dataset, and finally recommends the top N models with good performances on each of its neighbors to the new dataset. We also design a distributed system (DSTM) for training the models to reduce the time complexity substantially. In addition, we develop a multilayer selective stacked ensemble system (MSSE), whose base models are selected from among suitable candidate models based on their runtimes, classification accuracies, and diversities, to enhance the stability of CF-DAML. To our knowledge, this is the first work to combine memory-based CF and the selective stacked ensemble to solve the AutoML problem. Extensive experiments are conducted on many UCI datasets and the comparative results demonstrate that our approach outperforms the current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Dsa-PAML: a parallel automated machine learning system via dual-stacked autoencoder

Article 28 March 2022

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

Auto-sklearn: Efficient and Robust Automated Machine Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Agarwal A, Chauhan M, et al (2017) Similarity measures used in recommender systems: a study. International Journal of Engineering Technology Science and Research IJETSR :2394–3386
Ahuja S, Panigrahi BK, Dey N, Rajinikanth V, Gandhi TK (2021) Deep transfer learning-based automated detection of covid-19 from lung ct scan slices. Applied Intelligence 51(1):571–585
Article Google Scholar
Alshammari G, Kapetanakis S, Polatidis N, Petridis M (2018) A triangle multi-level item-based collaborative filtering method that improves recommendations. In: International conference on engineering applications of neural networks. Springer, pp 145–157
Asuncion A, Newman D (2007) UCI machine learning repository. Irvine, CA,
Aziz ZA, Abdulqader DN, Sallow AB, Omer HK (2021) Python parallel processing and multiprocessing: A rivew. Academic Journal of Nawroz University 10(3):345–354
Article Google Scholar
Bergstra J, Yamins D, Cox D (2013) Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: International conference on machine learning. PMLR, pp 115–123
Brazdil P, Carrier CG, Soares C, Vilalta R (2008) Metalearning: Applications to data mining. Springer Science & Business Media, Berlin
MATH Google Scholar
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI et al (2020) ilearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of dna, rna and protein sequence data. Briefings in bioinformatics 21(3):1047–1057
Article Google Scholar
Cui Z, Xu X, Fei X, Cai X, Cao Y, Zhang W, Chen J (2020) Personalized recommendation system based on collaborative filtering for iot scenarios. IEEE Transactions on Services Computing 13(4):685–695
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7:1–30
MathSciNet MATH Google Scholar
Dunjko V, Briegel HJ (2018) Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Reports on Progress in Physics 81(7):074001
Article MathSciNet Google Scholar
Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F (2015) Efficient and robust automated machine learning. In: Proceedings of the 28th international conference on neural information processing systems - Volume 2, NIPS’15. MIT Press, Cambridge, pp 2755-2763
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association 32(200):675–701
Article MATH Google Scholar
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11(1):86–92
Article MathSciNet MATH Google Scholar
Fusi N, Sheth R, Elibol M (2018) Probabilistic matrix factorization for automated machine learning. Advances in neural information processing systems 31:3348–3357
Google Scholar
Gogas P, Papadimitriou T (2021) Machine learning in economics and finance. Computational Economics 57(1):1–4
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18
Article Google Scholar
Han M, Park J, Baek W (2020) Design and implementation of a criticality-and heterogeneity-aware runtime system for task-parallel applications. IEEE Transactions on Parallel and Distributed Systems 32(5):1117–1132
Article Google Scholar
Han ST, Yingjiao R, Dongliang P, Mengfan X, Yunfei G (2020) A novel variable structure multi-model approach based on error-ambiguity decomposition. Chinese Journal of Aeronautics 33(6):1731–1746
Article Google Scholar
Jain G, Mahara T, Tripathi KN (2020) A survey of similarity measures for collaborative filtering-based recommender system. Soft Computing: Theories and Applications :343–352
Kant S, Mahara T (2018) Merging user and item based collaborative filtering to alleviate data sparsity. International Journal of System Assurance Engineering and Management 9(1):173–179
Google Scholar
Karabadji NEI, Beldjoudi S, Seridi H, Aridhi S, Dhifli W (2018) Improving memory-based user collaborative filtering with evolutionary multi-objective optimization. Expert Systems with Applications 98:153–165
Article Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30:3146–3154
Google Scholar
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai, vol 14. Montreal, Canada, pp 1137–1145
Google Scholar
Komer B, Bergstra J, Eliasmith C (2014) Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In: ICML workshop on AutoML, vol 9. Citeseer, p 50
Krogh A, Vedelsby J et al (1995) Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems 7:231–238
Google Scholar
Li S, Zhou X, Shi H, Pan F, Li X, Zhang Y (2018) Multimode processes monitoring based on hierarchical mode division and subspace decomposition. The Canadian Journal of Chemical Engineering 96(11):2420–2430
Article Google Scholar
Liu J, Jiang C, Zheng J (2021) Batch bayesian optimization via adaptive local search. Applied Intelligence 51(3):1280–1295
Article Google Scholar
Maher M, Sakr S (2019) Smartml: A meta learning-based framework for automated selection and hyperparameter tuning for machine learning algorithms. In: EDBT: 22nd international conference on extending database technology
Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University, Princeton
Google Scholar
Nguyen V, Gupta S, Rana S, Li C, Venkatesh S (2019) Filtering bayesian optimization approach in weakly specified search space. Knowledge and Information Systems 60(1):385–413
Article Google Scholar
Olson RS, Moore JH (2016) Tpot: A tree-based pipeline optimization tool for automating machine learning. In: Workshop on automatic machine learning. PMLR, pp 66–74
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. Journal of machine Learning research 12:2825–2830
MathSciNet MATH Google Scholar
Prabuchandran K, Penubothula S, Kamanchi C, Bhatnagar S (2021) Novel first order bayesian optimization with an application to reinforcement learning. Applied Intelligence 51(3):1565–1579
Article Google Scholar
Rahmel J et al (2020) Applying artificial intelligence in finance and asset management: A discussion of status quo and the way forward. Journal of Financial Transformation 51:67–74
Google Scholar
Ran SJ, Tirrito E, Peng C, Chen X, Tagliacozzo L, Su G, Lewenstein M (2020) Tensor network contractions: methods and applications to quantum many-body systems. Springer Nature, Berlin
Book MATH Google Scholar
Reif M, Shafait F, Dengel A (2012) Meta-learning for evolutionary parameter optimization of classifiers. Machine learning 87(3):357–380
Article MathSciNet Google Scholar
Rodríguez A, Navarro A, Asenjo R, Corbera F, Gran R, Suárez D, Nunez-Yanez J (2020) Parallel multiprocessing and scheduling on the heterogeneous xeon+ fpga platform. The Journal of Supercomputing 76(6):4645–4665
Article Google Scholar
Ryo M, Jeschke JM, Rillig MC, Heger T (2020) Machine learning with the hierarchy-of-hypotheses (hoh) approach discovers novel pattern in studies on biological invasions. Research synthesis methods 11(1):66–73
Article Google Scholar
van der Schaar M, Alaa AM, Floto A, Gimson A, Scholtes S, Wood A, McKinney E, Jarrett D, Lio P, Ercole A (2021) How artificial intelligence and machine learning can help healthcare systems respond to covid-19. Machine Learning 110(1):1–14
Article MathSciNet MATH Google Scholar
Schütt KT, Chmiela S, von Lilienfeld OA, Tkatchenko A, Tsuda K, Müller KR (2020) Scalone: machine learning meets quantum physics. Springer, Berlin
Book MATH Google Scholar
Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N (2015) Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1):148–175
Article Google Scholar
Shen J, Zhou T, Chen L (2020) Collaborative filtering-based recommendation system for big data. International Journal of Computational Science and Engineering 21(2):219–225
Article Google Scholar
Shi J, Yu T, Goebel K, Wu D (2021) Remaining useful life prediction of bearings using ensemble learning: The impact of diversity in base learners and features. Journal of Computing and Information Science in Engineering 21(2):021004
Article Google Scholar
Shvets AT (2020) Multiprocessing with tasks. In: Beginning Ada programming. Springer, pp 167–194
Singh PK, Sinha M, Das S, Choudhury P (2020) Enhancing recommendation accuracy of item-based collaborative filtering using bhattacharyya coefficient and most similar item. Applied Intelligence 50(12):4708–4731
Article Google Scholar
Srifi M, Oussous A, Ait Lahcen A, Mouline S (2020) Recommender systems based on collaborative filtering using review textsa survey. Information 11(6):317
Article Google Scholar
Stocker S, Csányi G, Reuter K, Margraf JT (2020) Machine learning in chemical reaction space. Nature communications 11(1):1–11
Article Google Scholar
Sun T, Zhou ZH (2018) Structural diversity for decision tree ensemble learning. Frontiers Comput. Sci. 12(3):560–570
Article Google Scholar
Székely GJ, Rizzo ML et al (2009) Brownian distance covariance. The annals of applied statistics 3(4):1236–1265
MathSciNet MATH Google Scholar
Tamke M, Nicholas P, Zwierzycki M (2018) Machine learning for architectural design: Practices and infrastructure. International Journal of Architectural Computing 16(2):123–143
Article Google Scholar
Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp 847–855
Tian Z, Luo C, Qiu J, Du X, Guizani M (2020) A distributed deep learning system for web attack detection on edge devices. IEEE Transactions on Industrial Informatics 16(3):1963–1971. https://doi.org/10.1109/TII.2019.2938778
Article Google Scholar
Ullah Z, Al-Turjman F, Mostarda L, Gagliardi R (2020) Applications of artificial intelligence and machine learning in smart cities. Computer Communications 154:313–323
Article Google Scholar
Valcarce D, Landin A, Parapar J, Barreiro Á (2019) Collaborative filtering embeddings for memory-based recommender systems. Engineering Applications of Artificial Intelligence 85:347–356
Article Google Scholar
Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS (2020) A survey on distributed machine learning. ACM Computing Surveys (CSUR) 53(2):1–33
Article Google Scholar
Wang D, Yih Y, Ventresca M (2020) Improving neighbor-based collaborative filtering by using a hybrid similarity measurement. Expert Systems with Applications 160:113651
Article Google Scholar
Wang Y, Deng J, Gao J, Zhang P (2017) A hybrid user similarity model for collaborative filtering. Information Sciences 418:102–118
Article Google Scholar
Wang Y, Wang D, Geng N, Wang Y, Yin Y, Jin Y (2019) Stacking-based ensemble learning of decision trees for interpretable prostate cancer detection. Applied Soft Computing 77:188–204
Article Google Scholar
Wolpert DH (1992) Stacked generalization. Neural networks 5(2):241–259
Article Google Scholar
Wu X, Zhang J, Wang FY (2020) Stability-based generalization analysis of distributed learning algorithms for big data. IEEE Transactions on Neural Networks and Learning Systems 31(3):801–812. https://doi.org/10.1109/TNNLS.2019.2910188
Article MathSciNet Google Scholar
Xie Y, He M, Ma T, Tian W (2021) Optimal distributed parallel algorithms for deep learning framework tensorflow. Applied Intelligence :1–21
Yang C, Akimoto Y, Kim DW, Udell M (2019) Oboe: Collaborative filtering for automl model selection. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. pp 1173–1183
Yang C, Fan J, Wu Z, Udell M (2020) Automl pipeline selection: Efficiently navigating the combinatorial space. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp 1446–1456
Yang C, Guo J, Zhang M et al (2018) Adaptive terminal sliding mode control method based on rbf neural network for operational auv and its experimental research. Robot 40(3):336–345
Google Scholar
Yu M, Quan T, Peng Q, Yu X, Liu L (2021) A model-based collaborate filtering algorithm based on stacked autoencoder. Neural Computing and Applications :1–9
Yue W, Wang Z, Liu W, Tian B, Lauria S, Liu X (2021) An optimally weighted user-and item-based collaborative filtering approach to predicting baseline data for friedreichs ataxia patients. Neurocomputing 419:287–294
Article Google Scholar
Yue W, Wang Z, Tian B, Pook M, Liu X (2020) A hybrid model-and memory-based collaborative filtering algorithm for baseline data prediction of friedreich’s ataxia patients. IEEE Transactions on Industrial Informatics 17(2):1428–1437
Article Google Scholar
Zhang J, Lin Y, Lin M, Liu J (2016) An effective collaborative filtering algorithm based on user preference clustering. Applied Intelligence 45(2):230–240
Article Google Scholar
Zhang Z, Zhang Y, Ren Y (2020) Employing neighborhood reduction for alleviating sparsity and cold start problems in user-based collaborative filtering. Information Retrieval Journal 23(4):449–472
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key R&D Program of China under Grant No. 2019YFB1706202.

Author information

Authors and Affiliations

Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, 110016, Shenyang, China
Pengjie Liu, Fucheng Pan, Xiaofeng Zhou, Shuai Li & Liang Jin
Shenyang Institute of Automation, Chinese Academy of Sciences, 110016, Shenyang, China
Pengjie Liu, Fucheng Pan, Xiaofeng Zhou, Shuai Li & Liang Jin
Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, 110169, Shenyang, China
Pengjie Liu, Fucheng Pan, Xiaofeng Zhou, Shuai Li & Liang Jin
University of Chinese Academy of Sciences, 100049, Beijing, China
Pengjie Liu & Shuai Li

Authors

Pengjie Liu
View author publications
You can also search for this author inPubMed Google Scholar
Fucheng Pan
View author publications
You can also search for this author inPubMed Google Scholar
Xiaofeng Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Shuai Li
View author publications
You can also search for this author inPubMed Google Scholar
Liang Jin
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Zhou.

Appendices

Appendix A: The models and their hyperparameters in the DataBase

Each number in brackets in the first column indicates times that the associated model appears in the DataBase. The second column contains hyperparameters of the models, where \({\{a, b,...\}}\) indicates discrete values; [a, b] represents values of a, a + 1, a + 2, ..., b; U(a, b) represents a uniform distribution in the region (a, b); \(e^{\lambda }, (a, b)\) represent a value distribution obeying an exponential distribution with the parameter \(\lambda\) and the value ranging from a to b. Preprocessors and their hyperparameter configurations are shown in Table 8, classifiers and their hyperparameter configurations are shown in Table 9.

Table 8 The preprocessors and their hyperparameter configurations

Full size table

Table 9 The classifiers and their hyperparameter configurations

Full size table

Appendix B: The distance correlation coefficients between selected meta-features

Here, we present the Dcc values for DatasetRatio (DR), InverseDatasetRatio (IDR), LogDatasetRatio (LDR), and LogInverseDatasetRatio (LIDR), in Table 10; for LogNumberOfFeatures (LOF) and NumberOfFeatures (NOF), in Table 11; for LogNumberOfInstances (LONI) and NumberOfInstances (NOI), in Table 12; for RatioNominalToNumerical (RNoTNu) and RatioNumericalToNominal (RNuTNo), in Table 13; for NumberOfClasses (NOC) and ClassEntropy (CE), in Table 14; for NumberOfCategoricalFeatures (NOCF) and SymbolsSum (SS), in Table 15; for NumberOfNumericFeatures (NuONuF) and LogNumberOfFeatures (LNuOF), in Table 16; and for PCAKurtosisFirstPC (PCAKFPC) and KurtosisMin (KM), in Table 17.

Table 10 Dcc values for {DR, IDR, LDR, LIDR}

Full size table

Table 11 Dcc values for {LOF, NOF}

Full size table

Table 12 Dcc values for {LNOI, NOI}

Full size table

Table 13 Dcc values for {RNoTNu, RNuTNo}

Full size table

Table 14 Dcc values for {NOC, CE}

Full size table

Table 15 Dcc values for {NOCF, SS}

Full size table

Table 16 Dcc values for {NuONuF, LNuOF}

Full size table

Table 17 Dcc values for {PCAKFPC, KM}

Full size table

The physical meanings of DR, LDR, IDR, and LIDR involve comparisons between the numbers of rows and columns in the datasets. The correlation coefficients between LDR and the other three meta-features are very large; therefore, LDR is chosen, and the other three meta-features are discarded (Tables 18 and 19).

Appendix C: Detailed attributes of the experimental datasets

Table 18 The detailed attributes of the small datasets

Full size table

Table 19 The detailed attributes of the medium and large datasets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, P., Pan, F., Zhou, X. et al. CF-DAML: Distributed automated machine learning based on collaborative filtering. Appl Intell 52, 17145–17169 (2022). https://doi.org/10.1007/s10489-021-03049-z

Download citation

Accepted: 27 November 2021
Published: 31 March 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10489-021-03049-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

CF-DAML: Distributed automated machine learning based on collaborative filtering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dsa-PAML: a parallel automated machine learning system via dual-stacked autoencoder

Hyper-Stacked: Scalable and Distributed Approach to AutoML for Big Data

Auto-sklearn: Efficient and Robust Automated Machine Learning

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: The models and their hyperparameters in the DataBase

Appendix B: The distance correlation coefficients between selected meta-features

Appendix C: Detailed attributes of the experimental datasets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now