Abstract
In this chapter we present a case study to demonstrate how the current state-of-the-art Genetic Programming (GP) fairs as a tool for the emerging field of Data Science. Data Science refers to the practice of extracting knowledge from data, often Big Data, to glean insights useful for predicting business, political or societal outcomes. Data Science tools are important to the practice as they allow Data Scientists to be productive and accurate. GP has many features that make it amenable as a tool for Data Science, but GP is not widely considered as a Data Science method as of yet. Thus, we performed a real-world comparison of GP with a popular Data Science method to understand its strengths and weaknesses. GP proved to find equally strong solutions, leveraged the new Big Data infrastructure, and was able to provide several benefits like direct feature importance and solution confidence. GP lacked the ability to quickly build and test models, required much more intensive computing power, and, due to its lack of commercial maturity, created some challenges for productization as well as integration with data management and visualization capabilities. The lessons learned leads to several recommendations that provide a path for future research to focus on key areas to improve GP as a Data Science tool.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Arnaldo I, Veeramachaneni K, O’Reilly UM (2014) Flash: A GP-GPU ensemble learning system for handling large datasets. In: Nicolau M, et al. (eds.) 17th European conference on genetic programming. LNCS, vol. 8599. Springer, Granada, pp 13–24
Castillo F, Kordon A, Sweeney J, Zirk W (2004) Using genetic programming in industrial statistical model building. In: O’Reilly UM, Yu T, Riolo RL, Worzel B (eds.) Genetic programming theory and practice II, Chap. 3 Springer, Ann Arbor, pp 31–48
De Rainville FM, Fortin FA, Gardner MA, Parizeau M, Gagne C (2012) DEAP: a python framework for evolutionary algorithms. In: Wagner S, Affenzeller M (eds.) GECCO 2012 evolutionary computation software systems (EvoSoft). ACM, Philadelphia, PA, pp 85–92
Dhar V (2013) Data science and prediction. Commun ACM 56(12):64–73
Dubcakova R (2011) Eureqa: software review. Genet. Program. Evolvable Mach. 12(2):173–178
Fazenda P, McDermott J, O’Reilly UM (2012) A library to run evolutionary algorithms in the cloud using MapReduce. In: Di Chio C, et al. (eds.) Applications of evolutionary computing, EvoApplications 2012, LNCS, vol. 7248. Springer, Malaga, pp 416–425
Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Gustafson S, Sheth A (2014) Web of things. Computing Now 7(3). http://www.computer.org/web/computingnow/archive/march2014
Icke I, Bongard J (2013) Improving genetic programming based symbolic regression using deterministic machine learning. In: de la Fraga LG (ed.) 2013 IEEE conference on evolutionary computation, Cancun, vol. 1, pp 1763–1770
Jones E, Oliphant E, Peterson P, et al. (2001) Scipy: open source scientific tools for python. http://wwwscipyorg
Kordon AK, Smits GF (2001) Soft sensor development using genetic programming. In: Spector L, et al. (eds.) Proceedings of the genetic and evolutionary computation conference (GECCO-2001). Morgan Kaufmann, San Francisco, CA, pp 1346–1351
Koza JR (1992) The genetic programming paradigm: Genetically breeding populations of computer programs to solve problems. In: Soucek B, the IRIS Group (eds.) Dynamic, genetic, and chaotic programming. Wiley, New York, pp 203–321
Loukides M (2010) What is Data science? OReilly Radar Report. http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.pdf
O’Neill M, Vanneschi L, Gustafson S, Banzhaf W (2010) Open issues in genetic programming. Genet Program Evolvable Mach 11(3/4):339–363 (tenth Anniversary Issue: Progress in Genetic Programming and Evolvable Machines)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science 324(5923):81–85. doi:10.1126/science.1165893. http://ccsl.mae.cornell.edu/sites/default/files/Science09_Schmidt.pdf
Smits GF, Vladislavleva E, Kotanchek ME (2010) Scalable symbolic regression by continuous evolution with very small populations. In: Riolo R, McConaghy T, Vladislavleva E (eds.) Genetic programming theory and practice VIII. Genetic and evolutionary computation, Chap. 9, vol. 8. Springer, Ann Arbor, pp 147–160
van der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30
van Harmelen F, Hendler JA, Hitzler P, Janowicz K (2015) Semantics for big data. AI Magazine 36(1):3–4
Veeramachaneni K, Arnaldo I, Derby O, O’Reilly UM (2015) FlexGP: Cloud-based ensemble learning with genetic programming for large regression problems. J Grid Comput 13(3):391–407
Wagner S, Kronberger G (2011) Algorithm and experiment design with heuristiclab: an open source optimization environment for research and education. In: Whitley D (ed.) GECCO 2011 tutorials. ACM, Dublin, pp 1411–1438
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Gustafson, S., Narasimhan, R., Palla, R., Yousuf, A. (2016). Using Genetic Programming for Data Science: Lessons Learned. In: Riolo, R., Worzel, W., Kotanchek, M., Kordon, A. (eds) Genetic Programming Theory and Practice XIII. Genetic and Evolutionary Computation. Springer, Cham. https://doi.org/10.1007/978-3-319-34223-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-34223-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34221-4
Online ISBN: 978-3-319-34223-8
eBook Packages: Computer ScienceComputer Science (R0)