Abstract
Incompleteness is one of the problematic data quality challenges in real-world machine learning tasks. A large number of studies have been conducted for addressing this challenge. However, most of the existing studies focus on the classification task and only a limited number of studies for symbolic regression with missing values exist. In this work, a new imputation method for symbolic regression with incomplete data is proposed. The method aims to improve both the effectiveness and efficiency of imputing missing values for symbolic regression. This method is based on genetic programming (GP) and weighted K-nearest neighbors (KNN). It constructs GP-based models using other available features to predict the missing values of incomplete features. The instances used for constructing such models are selected using weighted KNN. The experimental results on real-world data sets show that the proposed method outperforms a number of state-of-the-art methods with respect to the imputation accuracy, the symbolic regression performance, and the imputation time.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The used data are obtained from the publicly free repositories: UCI and OpenML.
References
Al-Helali B, Chen Q, Xue B, Zhang M (2018) A hybrid GP-KNN imputation for symbolic regression with missing values. In: Australasian joint conference on artificial intelligence. Springer, pp 345–357
Anjum A, Sun F, Wang L, Orchard J (2019) A novel continuous representation of genetic programmings using recurrent neural networks for symbolic regression. arXiv preprint arXiv:1904.03368
Arnaldo I, O’Reilly UM, Veeramachaneni K (2015) Building predictive models via feature synthesis. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 983–990
Chen C, Luo C, Jiang Z (2017) Elite bases regression: A real-time algorithm for symbolic regression. In: 2017 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). IEEE, pp 529–535
Chen Q (2018) Improving the generalisation of genetic programming for symbolic regression. PhD thesis, Victoria University of Wellington
Davidson JW, Savic DA, Walters GA (2003) Symbolic and numerical regression: experiments and applications. Inf Sci 150(1–2):95–117
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Fortin FA, Rainville FMD, Gardner MA, Parizeau M, Gagné C (2012) Deap: evolutionary algorithms made easy. J Mach Learn Res 13:2171–2175
García JCF, Kalenatic D, Bello CAL (2011) Missing data imputation in multivariate data by evolutionary algorithms. Comput Hum Behav 27(5):1468–1474
García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19(2):263–282
Gautam C, Ravi V (2015) Data imputation via evolutionary computation, clustering and a neural network. Neurocomputing 156:134–142
Ghorbani A, Zou JY (2018) Embedding for informative missingness: deep learning with incomplete data. In: 2018 56th annual allerton conference on communication, control, and computing (Allerton). IEEE, pp 437–445
Johnson CG (2003) Artificial immune system programming for symbolic regression. In: European conference on genetic programming. Springer, pp 345–353
Kammerer L, Kronberger G, Burlacu B, Winkler SM, Kommenda M, Affenzeller M (2020) Symbolic regression by exhaustive search: reducing the search space using syntactical constraints and efficient semantic structure deduplication. In: Genetic programming theory and practice, vol XVII. Springer, pp 79–99
Koza JR (1992) Genetic programming II, automatic discovery of reusable subprograms. MIT Press, Cambridge
Koza JR (1994) Genetic programming as a means for programming computers by natural selection. Stat Comput 4(2):87–112
Kronberger G (2011) Symbolic regression for knowledge discovery: bloat, overfitting, and variable interaction networks. Trauner, Linz
Kubalík J, Žegklitz J, Derner E, Babuška R (2019) Symbolic regression methods for reinforcement learning. arXiv preprint arXiv:1903.09688
Lobato F, Sales C, Araujo I, Tadaiesky V, Dias L, Ramos L, Santana A (2015a) Multi-objective genetic algorithm for missing data imputation. Pattern Recogn Lett 68:126–131
Lobato FM, Tadaiesky VW, Araújo IM, de Santana ÁL (2015b) An evolutionary missing data imputation method for pattern classification. In: Proceedings of the companion publication of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 1013–1019
Martins JFB, Oliveira LOV, Miranda LF, Casadei F, Pappa GL (2018) Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming. In: Proceedings of the genetic and evolutionary computation conference, pp 1151–1158
McConaghy T (2011) Ffx: Fast, scalable, deterministic symbolic regression technology. In: Genetic programming theory and practice, vol IX. Springer, pp 235–260
Oliveira LOV, Otero FE, Miranda LF, Pappa GL (2016) Revisiting the sequential symbolic regression genetic programming. In: 2016 5th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 163–168
O’Sullivan J, Ryan C (2002) An investigation into the use of different search strategies with grammatical evolution. In: European conference on genetic programming. Springer, pp 268–277
Patil DV, Bichkar R (2010) Multiple imputation of missing data with genetic algorithm based techniques. In: IJCA special issue on evolutionary computation for optimization techniques, pp 74–78
Pennachin C, Looks M, de Vasconcelos J (2011) Improved time series prediction and symbolic regression with affine arithmetic. In: Genetic programming theory and practice, vol IX. Springer, pp 97–112
Pornprasertmanit S, Miller P, Schoemann A, Quick C, Jorgensen T, Pornprasertmanit MS (2016) Package ’simsem’
Priya RD, Kuppuswami S (2012) A genetic algorithm based approach for imputing missing discrete attribute values in databases. WSEAS Trans Inf Sci Appl 9(6):169–178
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Salleh MNM, Samat NA (2017) An imputation for missing data features based on fuzzy swarm approach in heart disease classification. In: International conference in swarm intelligence. Springer, pp 285–292
Samat NA, Salleh MNM (2016) A study of data imputation using fuzzy c-means with particle swarm optimization. In: International conference on soft computing and data mining. Springer, pp 91–100
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147
Searson DP (2015) Gptips 2: an open-source software platform for symbolic data mining. In: Handbook of genetic programming applications. Springer, New York, pp 551–573
Takahashi M, Ito T (2012) Multiple imputation of turnover in edinet data: toward the improvement of imputation for the economic census. In: Work session on statistical data editing, UNECE, pp 24–26
Tran CT (2018) Evolutionary machine learning for classification with incomplete data. PhD thesis, Victoria University of Wellington
Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation. ACM, pp 583–590
Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163
Tran CT, Zhang M, Andreae P, Xue B (2017) Multiple imputation and genetic programming for classification with incomplete data. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 521–528
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw. https://doi.org/10.18637/jss.v045.i03
Vanschoren J, Van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Exp Newsl 15(2):49–60
Virgolin M, Alderliesten T, Bosman PA (2019) Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In: Proceedings of the genetic and evolutionary computation conference, pp 1084–1092
Virgolin M, Alderliesten T, Witteveen C, Bosman PAN. Improving model-based genetic programming for symbolic regression of small expressions. Evolut Comput 1–27. https://doi.org/10.1162/evco_a_00278. PMID:32574084
Vladislavleva E, Smits G, Den Hertog D (2010) On the importance of data balancing for symbolic regression. IEEE Trans Evolut Comput 14(2):252–277
Wang Y, Wagner N, Rondinelli JM (2019) Symbolic regression in materials science. MRS Commun 9(3):793–805
Žegklitz J, Pošík P (2020) Benchmarking state-of-the-art symbolic regression algorithms. In: Genetic programming and evolvable machines, pp 1–29
Zelinka I, Oplatkova Z, Nolle L (2005) Analytic programming-symbolic regression by means of arbitrary evolutionary algorithms. Int J Simul, Syst, Sci Technol 6(9):44–56
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Code availability
The methods are implemented using python language, mainly based on the open-source package DEAP.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Al-Helali, B., Chen, Q., Xue, B. et al. A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Comput 25, 5993–6012 (2021). https://doi.org/10.1007/s00500-021-05590-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-05590-y