{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,8]],"date-time":"2024-08-08T00:14:30Z","timestamp":1723076070887},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,5,29]]},"abstract":"Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.<\/jats:p>\n In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of \\corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.<\/jats:p>","DOI":"10.1145\/3654975","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T13:44:53Z","timestamp":1717076693000},"page":"1-25","source":"Crossref","is-referenced-by-count":1,"title":["SchemaPile: A Large Collection of Relational Database Schemas"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"http:\/\/orcid.org\/0000-0001-7483-0099","authenticated-orcid":false,"given":"Till","family":"D\u00f6hmen","sequence":"first","affiliation":[{"name":"University of Amsterdam, Amsterdam, NL"}]},{"ORCID":"http:\/\/orcid.org\/0009-0009-6728-3379","authenticated-orcid":false,"given":"Radu","family":"Geacu","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, NL"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-0949-7290","authenticated-orcid":false,"given":"Madelon","family":"Hulsebos","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, USA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-4722-5840","authenticated-orcid":false,"given":"Sebastian","family":"Schelter","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, NL"}]}],"member":"320","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_2_1","unstructured":"Andi Albrecht. 2023. python-sqlparse -- a non-validating SQL parser for Python. https:\/\/github.com\/andialbrecht\/sqlparse"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453916"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00094"},{"key":"e_1_2_1_5_1","volume-title":"SchemaDB: Structures in relational datasets. arXiv preprint arXiv:2111.12835","author":"Christopher Cody James","year":"2021","unstructured":"Cody James Christopher, Kristen Moore, and David Liebowitz. 2021. SchemaDB: Structures in relational datasets. arXiv preprint arXiv:2111.12835 (2021)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3085504.3085520"},{"key":"e_1_2_1_7_1","volume-title":"Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745","author":"Dong Haoyu","year":"2022","unstructured":"Haoyu Dong, Zhoujun Cheng, Xinyi He, Mengyu Zhou, Anda Zhou, Fan Zhou, Ao Liu, Shi Han, and Dongmei Zhang. 2022. Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745 (2022)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i11.26499"},{"key":"e_1_2_1_9_1","unstructured":"Will Eaton. 2020. sqloxide. https:\/\/github.com\/wseaton\/sqloxide."},{"key":"e_1_2_1_10_1","volume-title":"On the Naming of Database Objects in the SQL Databases of Some Existing Software. In Computer Science On-line Conference. Springer, 534--550","author":"Eessaar Erki","year":"2023","unstructured":"Erki Eessaar. 2023. On the Naming of Database Objects in the SQL Databases of Some Existing Software. In Computer Science On-line Conference. Springer, 534--550."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611527"},{"key":"e_1_2_1_12_1","unstructured":"Common Crawl Foundation. 2023. Common Crawl -- a free open repository of web crawl data that can be used by anyone. https:\/\/commoncrawl.org"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","unstructured":"Bogdan Ghita Peter Boncz and Diego Tom\u00e9. 2019. Public BI benchmark - part 1. https:\/\/doi.org\/10.5281\/zenodo.6277287","DOI":"10.5281\/zenodo.6277287"},{"key":"e_1_2_1_14_1","unstructured":"Google. 2023. Google Natural Language AI. https:\/\/cloud.google.com\/natural-language"},{"key":"e_1_2_1_15_1","first-page":"1","article-title":"Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning","volume":"1","author":"Gu Zihui","year":"2023","unstructured":"Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du. 2023. Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proceedings of the ACM on Management of Data, Vol. 1, 2 (2023), 1--28.","journal-title":"Proceedings of the ACM on Management of Data"},{"key":"e_1_2_1_16_1","volume-title":"Lora: Low-rank adaptation of large language models. ICLR","author":"Hu Edward J","year":"2022","unstructured":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. ICLR (2022)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588710"},{"key":"e_1_2_1_18_1","unstructured":"GitHub Inc. 2022. The GitHub Search API. https:\/\/docs.github.com\/en\/rest\/reference\/search."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882957"},{"key":"e_1_2_1_20_1","volume-title":"Compressing text classification models. arXiv preprint arXiv:1612.03651","author":"Joulin Armand","year":"2016","unstructured":"Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H\u00e9rve J\u00e9gou, and Tomas Mikolov. 2016b. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)."},{"key":"e_1_2_1_21_1","volume-title":"Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759","author":"Joulin Armand","year":"2016","unstructured":"Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016a. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016)."},{"key":"e_1_2_1_22_1","unstructured":"jsqlparse. https:\/\/github.com\/JSQLParser\/JSqlParser"},{"key":"e_1_2_1_23_1","volume-title":"CHORUS: Foundation Models for Unified Data Discovery and Exploration. arXiv preprint arXiv:2306.09610","author":"Kayali Moe","year":"2023","unstructured":"Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2023. CHORUS: Foundation Models for Unified Data Discovery and Exploration. arXiv preprint arXiv:2306.09610 (2023)."},{"key":"e_1_2_1_24_1","volume-title":"Jia Li, Chenghao Mou, Carlos Mu\u00f1oz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries.","author":"Kocetkov Denis","year":"2022","unstructured":"Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu\u00f1oz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022a. The Stack: 3 TB of permissively licensed source code. arxiv: 2211.15533 [cs.CL]"},{"key":"e_1_2_1_25_1","volume-title":"Jia Li, Chenghao Mou, Carlos Mu\u00f1oz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries.","author":"Kocetkov Denis","year":"2022","unstructured":"Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu\u00f1oz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022b. The Stack: 3 TB of permissively licensed source code. Preprint (2022)."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00047"},{"key":"e_1_2_1_27_1","volume-title":"Language Model Guided Transpilation. arXiv preprint arXiv:2309.14396","author":"Lee Celine","year":"2023","unstructured":"Celine Lee, Abdulrahman Mahmoud, Michal Kurek, Simone Campanoni, David Brooks, Stephen Chong, Gu-Yeon Wei, and Alexander M Rush. 2023. Guess & Sketch: Language Model Guided Transpilation. arXiv preprint arXiv:2309.14396 (2023)."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2872518.2889386"},{"key":"e_1_2_1_29_1","volume-title":"2023 b. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111","author":"Li Jinyang","year":"2023","unstructured":"Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, et al. 2023 b. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111 (2023)."},{"key":"e_1_2_1_30_1","unstructured":"Raymond Li Loubna Ben Allal Yangtian Zi Niklas Muennighoff Denis Kocetkov Chenghao Mou Marc Marone Christopher Akiki Jia Li Jenny Chim Qian Liu Evgenii Zheltonozhskii Terry Yue Zhuo Thomas Wang Olivier Dehaene Mishig Davaadorj Joel Lamy-Poirier Jo\u00e3o Monteiro Oleh Shliazhko Nicolas Gontier Nicholas Meade Armel Zebaze Ming-Ho Yee Logesh Kumar Umapathi Jian Zhu Benjamin Lipkin Muhtasham Oblokulov Zhiruo Wang Rudra Murthy Jason Stillerman Siva Sankalp Patel Dmitry Abulkhanov Marco Zocca Manan Dey Zhihan Zhang Nour Fahmy Urvashi Bhattacharyya Wenhao Yu Swayam Singh Nadav Timor Jennifer Ding Claire Schlesinger Hailey Schoelkopf Jan Ebert Tri Dao Mayank Mishra Alex Gu Jennifer Robinson Carolyn Jane Anderson Brendan Dolan-Gavitt Danish Contractor Siva Reddy Daniel Fried Dzmitry Bahdanau Yacine Jernite Carlos Mu\u00f1oz Ferrandis Sean Hughes Thomas Wolf Arjun Guha Leandro von Werra and Harm de Vries. 2023 a. StarCoder: may the source be with you!arxiv: 2305.06161 [cs.CL]"},{"key":"e_1_2_1_31_1","unstructured":"Microsoft. 2023. Presidio Analyzer. https:\/\/microsoft.github.io\/presidio\/analyzer\/"},{"key":"e_1_2_1_32_1","volume-title":"The CTU Prague Relational Learning Repository. arXiv preprint arXiv:1511.03086","author":"Motl Jan","year":"2015","unstructured":"Jan Motl and Oliver Schulte. 2015. The CTU Prague Relational Learning Repository. arXiv preprint arXiv:1511.03086 (2015)."},{"key":"e_1_2_1_33_1","volume-title":"Can Foundation Models Wrangle Your Data? VLDB","author":"Narayan Avanika","year":"2022","unstructured":"Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, and Christopher R\u00e9. 2022. Can Foundation Models Wrangle Your Data? VLDB (2022)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-38919-2_35"},{"key":"e_1_2_1_36_1","unstructured":"pglast. https:\/\/github.com\/lelit\/pglast."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3320212"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_2_1_39_1","unstructured":"Alexandra Rostin Oliver Albrecht Jana Bauckmann Felix Naumann and Ulf Leser. 2009. A machine learning approach to foreign key discovery.. In WebDB."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457274"},{"key":"e_1_2_1_42_1","unstructured":"simple-ddl parser https:\/\/github.com\/xnuinside\/simple-ddl-parser"},{"key":"e_1_2_1_43_1","unstructured":"sqlparse https:\/\/github.com\/andialbrecht\/sqlparse."},{"key":"e_1_2_1_44_1","unstructured":"sqlparser rs https:\/\/github.com\/sqlparser-rs\/sqlparser-rs"},{"key":"e_1_2_1_45_1","first-page":"3","article-title":"Data Integration: The Current Status and the Way Forward","volume":"41","author":"Stonebraker Michael","year":"2018","unstructured":"Michael Stonebraker, Ihab F Ilyas, et al. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. , Vol. 41, 2 (2018), 3--9.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_46_1","unstructured":"tidb. https:\/\/github.com\/pingcap\/tidb"},{"key":"e_1_2_1_47_1","unstructured":"George Sittas Toby Mao. 2022. SQLGlot. https:\/\/sqlglot.com\/"},{"key":"e_1_2_1_48_1","volume-title":"Information technology and libraries","author":"Veen Theo Van","year":"2019","unstructured":"Theo Van Veen. 2019. Wikidata. Information technology and libraries , Vol. 38, 2 (2019), 72--81."},{"key":"e_1_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Petros Venetis Alon Y Halevy Jayant Madhavan Marius Pasca Warren Shen Fei Wu and Gengxin Miao. 2011. Recovering semantics of tables on the web. (2011).","DOI":"10.14778\/2002938.2002939"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.14778\/3594512.3594518"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3520152"},{"key":"e_1_2_1_52_1","volume-title":"Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB","author":"Vogel Liane","year":"2023","unstructured":"Liane Vogel and Carsten Binnig. 2023. WikiDBs: A Corpus of Relational Databases From Wikidata. In Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023 (CEUR Workshop Proceedings, Vol. 3462). CEUR-WS.org. https:\/\/ceur-ws.org\/Vol-3462\/TADA3.pdf"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209950.3209952"},{"key":"e_1_2_1_54_1","volume-title":"Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning. In NeurIPS 2022 First Table Representation Workshop.","author":"Vos David","year":"2022","unstructured":"David Vos, Till D\u00f6hmen, and Sebastian Schelter. 2022. Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning. In NeurIPS 2022 First Table Representation Workshop."},{"key":"e_1_2_1_55_1","volume-title":"Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, and Caiming Xiong.","author":"Yu Tao","year":"2020","unstructured":"Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, and Caiming Xiong. 2020. Grappa: Grammar-augmented pre-training for table semantic parsing. arXiv preprint arXiv:2009.13845 (2020)."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1425"},{"key":"e_1_2_1_57_1","unstructured":"Zenodo. 2023. Zenodo - Research Shared. https:\/\/zenodo.org. https:\/\/zenodo.org"},{"key":"e_1_2_1_58_1","unstructured":"Yiyun Zhao Jiarong Jiang Yiqun Hu Wuwei Lan Henry Zhu Anuj Chauhan Alexander Li Lin Pan Jun Wang Chung-Wei Hang et al. 2022. Importance of synthesizing high-quality data for text-to-sql parsing. arXiv preprint arXiv:2212.08785 (2022)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3654975","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,7]],"date-time":"2024-08-07T22:51:51Z","timestamp":1723071111000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654975"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,29]]},"references-count":58,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,29]]}},"alternative-id":["10.1145\/3654975"],"URL":"https:\/\/doi.org\/10.1145\/3654975","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2024,5,29]]}}}