{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,6]],"date-time":"2024-09-06T20:48:07Z","timestamp":1725655687347},"reference-count":71,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2020,5]]},"abstract":"Automatic machine learning (AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal \"human-in-the-loop\" involvement.<\/jats:p>We present ARDA, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.<\/jats:p>","DOI":"10.14778\/3397230.3397235","type":"journal-article","created":{"date-parts":[[2020,6,29]],"date-time":"2020-06-29T11:46:24Z","timestamp":1593431184000},"page":"1373-1387","source":"Crossref","is-referenced-by-count":39,"title":["ARDA"],"prefix":"10.14778","volume":"13","author":[{"given":"Nadiia","family":"Chepurko","sequence":"first","affiliation":[{"name":"MIT CSAIL"}]},{"given":"Ryan","family":"Marcus","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]},{"given":"Emanuel","family":"Zgraggen","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]},{"given":"Raul Castro","family":"Fernandez","sequence":"additional","affiliation":[{"name":"University of Chicago"}]},{"given":"Tim","family":"Kraska","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]},{"given":"David","family":"Karger","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]}],"member":"320","published-online":{"date-parts":[[2020,6,26]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Microsoft Azure Services http:\/\/www.microsoft.com\/azure\/. Microsoft Azure Services http:\/\/www.microsoft.com\/azure\/."},{"key":"e_1_2_1_2_1","unstructured":"NYU Auctus https:\/\/datamart.d3m.vida-nyu.org\/. NYU Auctus https:\/\/datamart.d3m.vida-nyu.org\/."},{"key":"e_1_2_1_3_1","unstructured":"A. Athalye L. Engstrom A. Ilyas and K. Kwok. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397 2017. A. Athalye L. Engstrom A. Ilyas and K. Kwok. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397 2017."},{"key":"e_1_2_1_4_1","unstructured":"B. Baker O. Gupta N. Naik and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 2016. B. Baker O. Gupta N. Naik and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 2016."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305429"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for unbounded searching. Information processing letters 5(SLAC-PUB-1679) 1976. J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for unbounded searching. Information processing letters 5(SLAC-PUB-1679) 1976.","DOI":"10.1016\/0020-0190(76)90071-5"},{"key":"e_1_2_1_7_1","unstructured":"A. Bhardwaj S. Bhattacherjee A. Chavan A. Deshpande A. J. Elmore S. Madden and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798 2014. A. Bhardwaj S. Bhattacherjee A. Chavan A. Deshpande A. J. Elmore S. Madden and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798 2014."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824100"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824035"},{"key":"e_1_2_1_10_1","unstructured":"L. Breiman. Bias Variance and Arcing Classifiers. Technical report 1996. L. Breiman. Bias Variance and Arcing Classifiers. Technical report 1996."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313685"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687750"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2009.2035722"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10208-009-9045-5"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/tit.2005.862083"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3128572.3140444"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3058740"},{"key":"e_1_2_1_18_1","first-page":"209","volume-title":"Advances in neural information processing systems","author":"Cawley G. C.","year":"2007"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compeleceng.2013.11.024"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2488608.2488620"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"A. C. Davison. Statistical models volume 11. Cambridge University Press 2003. A. C. Davison. Statistical models volume 11. Cambridge University Press 2003.","DOI":"10.1017\/CBO9780511815850"},{"key":"e_1_2_1_22_1","volume-title":"University of Stavanger","author":"Deng L.","year":"2018"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"P. Domingos. The role of occam's razor in knowledge discovery. Data mining and knowledge discovery 3(4):409--425 1999. P. Domingos. The role of occam's razor in knowledge discovery. Data mining and knowledge discovery 3(4):409--425 1999.","DOI":"10.1023\/A:1009868929893"},{"key":"e_1_2_1_24_1","unstructured":"B. Donovan and D. Work. New york city taxi trip data (2010-2013) 2016. B. Donovan and D. Work. New york city taxi trip data (2010-2013) 2016."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2008.2005601"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00094"},{"key":"e_1_2_1_27_1","first-page":"2962","volume-title":"Advances in neural information processing systems","author":"Feurer M.","year":"2015"},{"key":"e_1_2_1_28_1","first-page":"3348","volume-title":"Advances in Neural Information Processing Systems","author":"Fusi N.","year":"2018"},{"key":"e_1_2_1_29_1","first-page":"1313","volume-title":"Proceedings of the National Conference on Artificial Intelligence","volume":"21","author":"Gatterbauer W."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807128.1807158"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"I. Guyon J. Weston S. Barnhill and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning 46(1--3):389--422 2002. I. Guyon J. Weston S. Barnhill and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning 46(1--3):389--422 2002.","DOI":"10.1023\/A:1012487302797"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903730"},{"key":"e_1_2_1_33_1","volume-title":"CIDR","author":"Halevy A. Y.","year":"2013"},{"issue":"3","key":"e_1_2_1_34_1","first-page":"5","article-title":"Managing google's data lake: an overview of the goods system","volume":"39","author":"Halevy A. Y.","year":"2016","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_35_1","first-page":"235","volume-title":"Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference","author":"Hall M. A.","year":"1999"},{"key":"e_1_2_1_36_1","unstructured":"X. He K. Zhao and X. Chu. Automl: A survey of the state-of-the-art. arXiv preprint arXiv:1908.00709 2019. X. He K. Zhao and X. Chu. Automl: A survey of the state-of-the-art. arXiv preprint arXiv:1908.00709 2019."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-1-55860-335-6.50023-4"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/SAI.2014.6918213"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-1-55860-247-2.50037-1"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial intelligence 97(1--2):273--324 1997. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial intelligence 97(1--2):273--324 1997.","DOI":"10.1016\/S0004-3702(97)00043-X"},{"issue":"12","key":"e_1_2_1_41_1","first-page":"2150","volume":"11","author":"Kraska T.","year":"2018","journal-title":"An Interactive Data Science System. PVLDB"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882952"},{"key":"e_1_2_1_43_1","first-page":"1188","volume-title":"International Conference on Machine Learning","author":"Le Q.","year":"2014"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3242042"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2862435"},{"key":"e_1_2_1_46_1","first-page":"4","volume-title":"Feature Selection in Data Mining","author":"Liu H.","year":"2010"},{"key":"e_1_2_1_47_1","article-title":"Toward integrating feature selection algorithms for classification and clustering","author":"Liu H.","year":"2005","journal-title":"IEEE Transactions on Knowledge & Data Engineering, pages 491--502"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2016.2616161"},{"key":"e_1_2_1_49_1","unstructured":"A. Madry A. Makelov L. Schmidt D. Tsipras and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083 2017. A. Madry A. Makelov L. Schmidt D. Tsipras and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083 2017."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2886471"},{"key":"e_1_2_1_51_1","unstructured":"P. McCullagh. What is a statistical model? Annals of statistics pages 1225--1267 2002. P. McCullagh. What is a statistical model? Annals of statistics pages 1225--1267 2002."},{"key":"e_1_2_1_52_1","first-page":"3111","volume-title":"Advances in neural information processing systems","author":"Mikolov T.","year":"2013"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2013.21"},{"issue":"5","key":"e_1_2_1_54_1","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1017\/S1930297500002205","article-title":"Running experiments on amazon mechanical turk","volume":"5","author":"Paolacci G.","year":"2010","journal-title":"Judgment and Decision making"},{"key":"e_1_2_1_55_1","unstructured":"J. M. Phillips. Coresets and sketches. arXiv preprint arXiv:1601.00617 2016. J. M. Phillips. Coresets and sketches. arXiv preprint arXiv:1601.00617 2016."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.5555\/2540128.2540361"},{"key":"e_1_2_1_57_1","doi-asserted-by":"crossref","unstructured":"M. Robnik-\u0160ikonja and I. Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine learning 53(1--2):23--69 2003. M. Robnik-\u0160ikonja and I. Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine learning 53(1--2):23--69 2003.","DOI":"10.1023\/A:1025667309714"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157804"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319863"},{"key":"e_1_2_1_60_1","unstructured":"E. R. Sparks A. Talwalkar M. J. Franklin M. I. Jordan and T. Kraska. Tupaq: An efficient planner for large-scale predictive analytic queries. arXiv preprint arXiv:1502.00068 2015. E. R. Sparks A. Talwalkar M. J. Franklin M. I. Jordan and T. Kraska. Tupaq: An efficient planner for large-scale predictive analytic queries. arXiv preprint arXiv:1502.00068 2015."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2009.2039172"},{"key":"e_1_2_1_62_1","doi-asserted-by":"crossref","unstructured":"X. Sun and B. Bischl. Tutorial and survey on probabilistic graphical model and variational inference in deep reinforcement learning. arXiv preprint arXiv:1908.09381 2019. X. Sun and B. Bischl. Tutorial and survey on probabilistic graphical model and variational inference in deep reinforcement learning. arXiv preprint arXiv:1908.09381 2019.","DOI":"10.1109\/SSCI44817.2019.9003114"},{"key":"e_1_2_1_63_1","doi-asserted-by":"crossref","unstructured":"Y. Sun. Iterative relief for feature weighting: algorithms theories and applications. IEEE transactions on pattern analysis and machine intelligence 29(6):1035--1051 2007. Y. Sun. Iterative relief for feature weighting: algorithms theories and applications. IEEE transactions on pattern analysis and machine intelligence 29(6):1035--1051 2007.","DOI":"10.1109\/TPAMI.2007.1093"},{"key":"e_1_2_1_64_1","unstructured":"C. Szegedy W. Zaremba I. Sutskever J. Bruna D. Erhan I. Goodfellow and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 2013. C. Szegedy W. Zaremba I. Sutskever J. Bruna D. Erhan I. Goodfellow and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 2013."},{"key":"e_1_2_1_65_1","volume-title":"CIDR","author":"Terrizzano I. G.","year":"2015"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2487629"},{"key":"e_1_2_1_67_1","unstructured":"E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851 2017. E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851 2017."},{"key":"e_1_2_1_68_1","doi-asserted-by":"crossref","unstructured":"D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends\u00ae in Theoretical Computer Science 10(1--2):1--157 2014. D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends\u00ae in Theoretical Computer Science 10(1--2):1--157 2014.","DOI":"10.1561\/0400000060"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.3934\/jimo.2012.8.1057"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213848"},{"key":"e_1_2_1_71_1","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1007\/978-1-4615-5725-8_8","volume-title":"Feature extraction, construction and selection","author":"Yang J.","year":"1998"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3397230.3397235","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,10,3]],"date-time":"2023-10-03T02:52:40Z","timestamp":1696301560000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3397230.3397235"}},"subtitle":["automatic relational data augmentation for machine learning"],"short-title":[],"issued":{"date-parts":[[2020,5]]},"references-count":71,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2020,5]]}},"alternative-id":["10.14778\/3397230.3397235"],"URL":"https:\/\/doi.org\/10.14778\/3397230.3397235","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2020,5]]}}}