Abstract
Automatic feature engineering aims to construct informative features automatically and reduce manual labor for machine learning applications. The majority of existing approaches are designed to handle tasks with only one data source, which are less applicable to real scenarios. In this paper, we present a distributed automatic feature engineering algorithm, DAFEE, to generate features among multiple large-scale relational datasets. Starting from the target table, the algorithm uses a Breadth-First-Search type algorithm to find its related tables and constructs advanced high-order features that are remarkably effective in practical applications. Moreover, DAFEE implements a feature selection method to reduce the computational cost and improve predictive performance. Furthermore, it is highly optimized to process a massive volume of data. Experimental results demonstrate that it can significantly improve the predictive performance by 7% compared to SOTA algorithms.
W. Zhao and X. Li—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Battiti, R.: Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 5(4), 537–550 (1994)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Davis, J.C., Sampson, R.J.: Statistics and Data Analysis in Geology, vol. 646. Wiley, New York (1986)
Dor, O., Reich, Y.: Strengthening learning algorithms by feature discovery. Inf. Sci. 189, 176–190 (2012)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, Hoboken (2012)
Guo, H., Jack, L.B., Nandi, A.K.: Feature generation using genetic programming with application to fault classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 35(1), 89–99 (2005)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems, pp. 507–514 (2006)
He, X., et al.: Practical lessons from predicting clicks on ads at Facebook. In: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9 (2014)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration (extended version). Technical report TR-2010-10. Computer Science, University of British Columbia (2010)
Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. IEEE (2015)
Katz, G., Shin, E.C.R., Song, D.: ExploreKit: automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 979–984. IEEE (2016)
Kaul, A., Maheshwary, S., Pudi, V.: AutoLearn—automated feature generation and selection. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 217–226. IEEE (2017)
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, pp. 3146–3154 (2017)
Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: automated feature engineering for supervised learning. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 1304–1307. IEEE (2016)
Lam, H.T., Minh, T.N., Sinn, M., Buesser, B., Wistuba, M.: Neural feature learning from relational database. arXiv preprint arXiv:1801.05372 (2018)
Lam, H.T., Thiebaut, J.M., Sinn, M., Chen, B., Mai, T., Alkan, O.: One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327 (2017)
Leather, H., Bonilla, E., O’Boyle, M.: Automatic feature generation for machine learning based optimizing compilation. In: 2009 International Symposium on Code Generation and Optimization, pp. 81–91. IEEE (2009)
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Workshop on Speech and Natural Language, pp. 212–217. Association for Computational Linguistics (1992)
Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 1–45 (2017)
Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, pp. 388–391. IEEE (1995)
Markovitch, S., Rosenstein, D.: Feature generation using general constructor functions. Mach. Learn. 49(1), 59–98 (2002)
Mitra, P., Murthy, C., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 301–312 (2002)
Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E.B., Turaga, D.S.: Learning feature engineering for classification. In: IJCAI, pp. 2529–2535 (2017)
Sheikhpour, R., Sarram, M.A., Gharaghani, S., Chahooki, M.A.Z.: A survey on semi-supervised feature selection methods. Pattern Recogn. 64, 141–158 (2017)
Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In: Data Classification: Algorithms and Applications, p. 37 (2014)
Tran, B., Xue, B., Zhang, M.: Genetic programming for feature construction and selection in classification on high-dimensional data. Memetic Comput. 8(1), 3–15 (2015). https://doi.org/10.1007/s12293-015-0173-y
Yuanfei, L., et al.: AutoCross: automatic feature crossing for tabular data in real-world applications. arXiv preprint arXiv:1904.12857 (2019)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I., et al.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, W., Li, X., Rong, G., Lin, M., Lin, C., Yang, Y. (2020). DAFEE: A Scalable Distributed Automatic Feature Engineering Algorithm for Relational Datasets. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12453. Springer, Cham. https://doi.org/10.1007/978-3-030-60239-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-60239-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60238-3
Online ISBN: 978-3-030-60239-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)