{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,10,22]],"date-time":"2024-10-22T23:40:17Z","timestamp":1729640417911,"version":"3.28.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,5,26]]},"abstract":"The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https:\/\/gittables.github.io.<\/jats:p>","DOI":"10.1145\/3588710","type":"journal-article","created":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T17:42:05Z","timestamp":1685468525000},"page":"1-17","source":"Crossref","is-referenced-by-count":12,"title":["GitTables: A Large-Scale Corpus of Relational Tables"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-0949-7290","authenticated-orcid":false,"given":"Madelon","family":"Hulsebos","sequence":"first","affiliation":[{"name":"University of Amsterdam, Amsterdam, Netherlands"}]},{"ORCID":"http:\/\/orcid.org\/0009-0003-2080-0443","authenticated-orcid":false,"given":"\u00c7agatay","family":"Demiralp","sequence":"additional","affiliation":[{"name":"Sigma Computing, San Francisco, CA, USA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-0183-6910","authenticated-orcid":false,"given":"Paul","family":"Groth","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2023,5,30]]},"reference":[{"volume-title":"DBpedia: A nucleus for a web of open data. ISWC","year":"2007","author":"Auer S\u00f6ren","key":"e_1_2_2_1_1","unstructured":"S\u00f6ren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. ISWC (2007), 722--735."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442188.3445922"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2501511.2501516"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00158"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_2_2_6_1","unstructured":"Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3240492"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453916"},{"volume-title":"Eugene Wu, and Yang Zhang.","year":"2008","author":"Cafarella Michael J.","key":"e_1_2_2_9_1","unstructured":"Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008b. WebTables: Exploring the Power of Tables on the Web. PVLDB (2008), 538--549."},{"volume-title":"Daisy Zhe Wang, and Eugene Wu","year":"2008","author":"Cafarella Michael J","key":"e_1_2_2_10_1","unstructured":"Michael J Cafarella, Alon Y Halevy, Yang Zhang, Daisy Zhe Wang, and Eugene Wu. 2008c. Uncovering the Relational Web.. In WebDB. 1--6."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-62466-8_21"},{"volume-title":"Juan Sequeda, Kavitha Srinivas, Nora Abdelmageed, Madelon Hulsebos, Daniela Oliveira, and Catia Pesquita.","year":"2021","author":"Cutrona Vincenzo","key":"e_1_2_2_13_1","unstructured":"Vincenzo Cutrona, Jiaoyan Chen, Vasilis Efthymiou, Oktie Hassanzadeh, Ernesto Jim\u00e9 nez-Ruiz, Juan Sequeda, Kavitha Srinivas, Nora Abdelmageed, Madelon Hulsebos, Daniela Oliveira, and Catia Pesquita. 2021. Results of SemTab 2021. In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021 (CEUR Workshop Proceedings, Vol. 3103). CEUR-WS.org, 1--12. http:\/\/ceur-ws.org\/Vol-3103\/paper0.pdf"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430921"},{"key":"e_1_2_2_16_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. 4171--4186."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/BDC.2015.30"},{"key":"e_1_2_2_18_1","unstructured":"Daniele Faraglia and Other Contributors. 2014. Faker. https:\/\/github.com\/joke2k\/faker"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2844544"},{"key":"e_1_2_2_20_1","unstructured":"Kevin Hu Neil Gaikwad Michiel Bakker Madelon Hulsebos Emanuel Zgraggen C\u00e9sar Hidalgo Tim Kraska Guoliang Li Arvind Satyanarayan and cC aug atay Demiralp. 2019. VizNet: Towards a large-scale visualization learning and benchmarking repository. In CHI. ACM."},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330993"},{"key":"e_1_2_2_22_1","volume-title":"CEUR Workshop Proceedings","volume":"2775","author":"Jimenez-Ruiz Ernesto","year":"2020","unstructured":"Ernesto Jimenez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, Kavitha Srinivas, and Vincenzo Cutrona. 2020. Results of SemTab 2020. In CEUR Workshop Proceedings, Vol. 2775. 1--8."},{"volume-title":"Dataset Reuse: Toward Translating Principles to Practice. Patterns","year":"2020","author":"Koesten Laura","key":"e_1_2_2_23_1","unstructured":"Laura Koesten, Pavlos Vougiouklis, Elena Simperl, and Paul Groth. 2020. Dataset Reuse: Toward Translating Principles to Practice. Patterns (2020), 100136."},{"volume-title":"Towards Learned Metadata Extraction for Data Lakes. BTW 2021","year":"2021","author":"Langenecker Sven","key":"e_1_2_2_24_1","unstructured":"Sven Langenecker, Christoph Sturm, Christian Schalles, and Carsten Binnig. 2021. Towards Learned Metadata Extraction for Data Lakes. BTW 2021 (2021)."},{"volume-title":"Deep learning. nature","year":"2015","author":"LeCun Yann","key":"e_1_2_2_25_1","unstructured":"Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436."},{"key":"e_1_2_2_26_1","doi-asserted-by":"crossref","unstructured":"Oliver Lehmberg Dominique Ritze Robert Meusel and Christian Bizer. 2016. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In WWW Companion. 75--76.","DOI":"10.1145\/2872518.2889386"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098102"},{"key":"e_1_2_2_28_1","first-page":"1","article-title":"pandas: a foundational Python library for data analysis and statistics","volume":"14","author":"McKinney Wes","year":"2011","unstructured":"Wes McKinney et al. 2011. pandas: a foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, Vol. 14, 9 (2011), 1--9.","journal-title":"Python for High Performance and Scientific Computing"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/OBD.2016.18"},{"volume-title":"The CTU prague relational learning repository. arXiv preprint arXiv:1511.03086","year":"2015","author":"Motl Jan","key":"e_1_2_2_30_1","unstructured":"Jan Motl and Oliver Schulte. 2015. The CTU prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015)."},{"key":"e_1_2_2_31_1","unstructured":"Hannes M\u00fchleisen and Christian Bizer. 2012. Web Data Commons - extracting structured data from two large web corpora. In LDOW."},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964909"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1038\/538127a"},{"key":"e_1_2_2_34_1","unstructured":"Plotly. 2018. Plotly Community Feed. https:\/\/chart-studio.plotly.com\/feed\/"},{"volume-title":"Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. 33rd Conference on Neural Information Processing Systems","year":"2019","author":"Rabanser Stephan","key":"e_1_2_2_35_1","unstructured":"Stephan Rabanser, Stephan G\u00fcnnemann, and Zachary C Lipton. 2019. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. 33rd Conference on Neural Information Processing Systems (2019)."},{"volume-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR","year":"2020","author":"Raffel Colin","key":"e_1_2_2_36_1","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR (2020)."},{"key":"e_1_2_2_37_1","first-page":"19","article-title":"Matching web tables to DBpedia -- a feature utility study","volume":"42","author":"Ritze Dominique","year":"2017","unstructured":"Dominique Ritze and Christian Bizer. 2017. Matching web tables to DBpedia -- a feature utility study. EDBT, Vol. 42, 41 (2017), 19.","journal-title":"EDBT"},{"key":"e_1_2_2_38_1","unstructured":"Dominique Ritze Oliver Lehmberg and Christian Bizer. 2021. T2Dv2 Gold Standard for Matching Web Tables to DBpedia. http:\/\/webdatacommons.org\/webtables\/goldstandardV2.html Accessed: 01-05--2021."},{"key":"e_1_2_2_39_1","unstructured":"Jeni Tenneson. 2016. CSV on the web: A primer. http:\/\/www.w3.org\/TR\/2016\/NOTE-tabular-data-primer-20160225\/"},{"key":"e_1_2_2_40_1","unstructured":"Princeton University. 2010. About WordNet. https:\/\/wordnet.princeton.edu"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-019-00646-y"},{"volume-title":"Jesse Kriss, and Matt McKeon.","year":"2007","author":"Viegas Fernanda B","key":"e_1_2_2_42_1","unstructured":"Fernanda B Viegas, Martin Wattenberg, Frank Van Ham, Jesse Kriss, and Matt McKeon. 2007. Manyeyes: a site for visualization at internet scale. IEEE transactions on visualization and computer graphics, Vol. 13, 6 (2007), 1121--1128."},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58580-8_43"},{"volume-title":"Xin Luna Dong, and Meng Jiang","year":"2021","author":"Wang Daheng","key":"e_1_2_2_44_1","unstructured":"Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, and Meng Jiang. 2021. TCN: Table Convolutional Network for Web Table Interpretation. arXiv preprint arXiv:2102.09460 (2021)."},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3003665.3003669"},{"volume-title":"WDC Web Table Corpus","year":"2012","key":"e_1_2_2_46_1","unstructured":"WebDataCommons. 2021. WDC Web Table Corpus 2012. http:\/\/webdatacommons.org\/webtables\/2012\/relationalStatistics.html"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476393"},{"key":"e_1_2_2_48_1","unstructured":"Pengcheng Yin Graham Neubig Wen-tau Yih and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL."},{"key":"e_1_2_2_49_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3372117","article-title":"Web table extraction, retrieval, and augmentation: A survey","volume":"11","author":"Zhang Shuo","year":"2020","unstructured":"Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 11, 2 (2020), 1--35.","journal-title":"ACM Transactions on Intelligent Systems and Technology (TIST)"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3588710","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,22]],"date-time":"2024-10-22T23:14:04Z","timestamp":1729638844000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3588710"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,26]]},"references-count":49,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,5,26]]}},"alternative-id":["10.1145\/3588710"],"URL":"https:\/\/doi.org\/10.1145\/3588710","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2023,5,26]]}}}