{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,5]],"date-time":"2024-09-05T00:33:50Z","timestamp":1725496430353},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,7]]},"abstract":"Efficient data discovery is crucial in the era of data-driven decisionmaking. However, current practices face significant challenges due to the intricacies of identifying datasets with specific distributional characteristics, such as percentiles, when data repositories are decentralized. Traditional keyword-based search methods are insufficient for these complex requirements, often resulting in suboptimal dataset search results. To address these challenges, this paper presents Fainder, a fast and accurate index for \"percentile predicates\" on histogram-based data summaries, which streamlines the search process for datasets with specific distributional requirements. Fainder can be constructed on heterogeneous histogram collections and employs binary search in conjunction with multi-step pruning techniques to efficiently identify search results for percentile predicates. Thereby, it simplifies data provisioning and improves the effectiveness of dataset discovery. Empirical evaluation of our solution on three large-scale data repositories shows that Fainder is effective for distribution-aware dataset search and provides order-of-magnitude efficiency gains over baselines.<\/jats:p>","DOI":"10.14778\/3681954.3681999","type":"journal-article","created":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T16:23:36Z","timestamp":1725035016000},"page":"3269-3282","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search"],"prefix":"10.14778","volume":"17","author":[{"given":"Lennart","family":"Behme","sequence":"first","affiliation":[{"name":"BIFOLD & TU Berlin"}]},{"given":"Sainyam","family":"Galhotra","sequence":"additional","affiliation":[{"name":"Cornell University"}]},{"given":"Kaustubh","family":"Beedkar","sequence":"additional","affiliation":[{"name":"IIT Delhi"}]},{"given":"Volker","family":"Markl","sequence":"additional","affiliation":[{"name":"BIFOLD, TU Berlin & DFKI"}]}],"member":"320","published-online":{"date-parts":[[2024,8,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0389-y"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551858"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3572751.3572755"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 23rd International Conference on Extending Database Technology (EDBT '20)","author":"Behar Rachel","year":"2020","unstructured":"Rachel Behar and Sara Cohen. 2020. Optimal Histograms with Outliers. Proceedings of the 23rd International Conference on Extending Database Technology (EDBT '20), 181--192."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00077"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1751-5823.2010.00112.x"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457403"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MNET.2021.9387709"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-37456-2_14"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476346"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE '18)","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, et al. 2018. Aurum: A Data Discovery System. Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE '18), 1001--1012."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523223"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-019-00564-x"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Graham Cormode Minos Garofalakis Peter J. Haas and Chris Jermaine. 2011. Synopses for Massive Data: Samples Histograms Wavelets Sketches. Foundations and Trends\u00ae in Databases 4 1 1--294.","DOI":"10.1561\/1900000004"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the Joint Statistical Meetings (JSM '14)","author":"Culotta Aron","year":"2014","unstructured":"Aron Culotta. 2014. Reducing Sampling Bias in Social Media Data for County Health Inference. Proceedings of the Joint Statistical Meetings (JSM '14)."},{"volume-title":"Ethics of Data and Analytics","author":"Dastin Jeffrey","key":"e_1_2_1_18_1","unstructured":"Jeffrey Dastin. 2022. Amazon Scraps Secret AI Recruiting Tool that Showed Bias against Women. In Ethics of Data and Analytics. Kirsten Martin, (Ed.) Auerbach Publications, 296--299."},{"volume-title":"Find the right data, effortlessly. Retrieved","year":"2024","key":"e_1_2_1_19_1","unstructured":"Datarade. 2024. Find the right data, effortlessly. Retrieved Feb. 6, 2024 from https:\/\/datarade.ai\/."},{"volume-title":"Data Hubs & Data Spaces. Retrieved","year":"2024","key":"e_1_2_1_20_1","unstructured":"Dawex. 2024. Data Marketplaces, Data Hubs & Data Spaces. Retrieved Feb. 6, 2024 from https:\/\/www.dawex.com\/en\/."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/3529337.3529353"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517864"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00213"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626246.3654747"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00045"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.4236\/ojs.2016.63035"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903730"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.43"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588710"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/645924.671191"},{"volume-title":"Measurement, Simulation, and Modeling","author":"Jain Raj K.","key":"e_1_2_1_31_1","unstructured":"Raj K. Jain. 1991. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley Computer Publishing."},{"key":"e_1_2_1_32_1","volume-title":"Retrieved","author":"Kaggle Inc.","year":"2024","unstructured":"Kaggle Inc. 2024. Kaggle Datasets. Retrieved Jan. 29, 2024 from https:\/\/www.kaggle.com\/datasets."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3572751.3572757"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00047"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the BTW, 995--1008","author":"Langenecker Sven","year":"2023","unstructured":"Sven Langenecker, Christoph Sturm, Christian Schalles, and Carsten Binnig. 2023. SportsTables: A New Corpus for Semantic Type Detection. Proceedings of the BTW, 995--1008."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3022860.3022863"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982.1056489"},{"volume-title":"Introduction to Information Retrieval","author":"Manning Christopher D.","key":"e_1_2_1_38_1","unstructured":"Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch\u00fctze. 2008. Introduction to Information Retrieval. Cambridge University Press."},{"key":"e_1_2_1_39_1","first-page":"59","article-title":"Making Open Data Transparent: Data Discovery on Open Data","volume":"41","author":"Miller Renee J.","year":"2018","unstructured":"Renee J. Miller, Fatemeh Nargesian, Erkang Zhu, et al. 2018. Making Open Data Transparent: Data Discovery on Open Data. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 41, 2, 59--70.","journal-title":"Bulletin of the IEEE Computer Society Technical Committee on Data Engineering"},{"key":"e_1_2_1_40_1","volume-title":"A major flaw in Google's algorithm allegedly tagged two black people's faces with the word 'gorillas'. Business Insider. Retrieved","author":"Mulshine Molly","year":"2024","unstructured":"Molly Mulshine. 2015. A major flaw in Google's algorithm allegedly tagged two black people's faces with the word 'gorillas'. Business Insider. Retrieved Feb. 6, 2024 from https:\/\/www.businessinsider.com\/google-tags-black-people-as-gorillas-2015-7."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476299"},{"key":"e_1_2_1_42_1","first-page":"237","article-title":"Data Lake Organization","volume":"35","author":"Nargesian Fatemeh","year":"2023","unstructured":"Fatemeh Nargesian, Ken Pu, Bahar Ghadiri-Bashardoost, et al. 2023. Data Lake Organization. IEEE Transactions on Knowledge and Data Engineering, 35, 1, 237--250.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380605"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964909"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the World Wide Web Conference (WWW '19)","author":"Noy Natasha","year":"2019","unstructured":"Natasha Noy, Matthew Burgess, and Dan Brickley. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. Proceedings of the World Wide Web Conference (WWW '19), 1365--1375."},{"key":"e_1_2_1_46_1","volume-title":"Retrieved","author":"Foundation Open Knowledge","year":"2022","unstructured":"Open Knowledge Foundation. 2022. CKAN. Retrieved Aug. 28, 2022 from http:\/\/ckan.org\/."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476364"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.5555\/3225634.3225774"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476353"},{"key":"e_1_2_1_51_1","volume-title":"Are Face-Detection Cameras Racist? Time. Retrieved","author":"Rose Adam","year":"2024","unstructured":"Adam Rose. 2010. Are Face-Detection Cameras Racist? Time. Retrieved Feb. 6, 2024 from https:\/\/time.com\/archive\/6906847\/are-face-detection-cameras-racist\/."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00264"},{"key":"e_1_2_1_53_1","volume-title":"Most engineers are white --- and so are the faces they use to train software. Vox. Retrieved","author":"Townsend Tess","year":"2024","unstructured":"Tess Townsend. 2017. Most engineers are white --- and so are the faces they use to train software. Vox. Retrieved Feb. 6, 2024 from https:\/\/www.vox.com\/2017\/1\/18\/14304964\/data-facial-recognition-trouble-recognizing-black-white-faces-diversity."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3456859.3456861"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462909"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626756"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.5555\/2481562.2481564"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00154-4"},{"key":"e_1_2_1_59_1","first-page":"67","article-title":"Keyword Search in Relational Databases: A Survey","volume":"33","author":"Yu Jeffrey Xu","year":"2010","unstructured":"Jeffrey Xu Yu, Lu Qin, and Lijun Chang. 2010. Keyword Search in Relational Databases: A Survey. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 33, 1, 67--78.","journal-title":"Bulletin of the IEEE Computer Society Technical Committee on Data Engineering"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482427"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178876.3186067"},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the ACM International Conference on Management of Data (SIGMOD '20), 1951","author":"Zhang Yi","year":"1966","unstructured":"Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. Proceedings of the ACM International Conference on Management of Data (SIGMOD '20), 1951--1966."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611498"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3681954.3681999","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T18:34:46Z","timestamp":1725474886000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3681954.3681999"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7]]},"references-count":63,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["10.14778\/3681954.3681999"],"URL":"https:\/\/doi.org\/10.14778\/3681954.3681999","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2024,7]]},"assertion":[{"value":"2024-08-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}