{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,6,24]],"date-time":"2024-06-24T18:10:10Z","timestamp":1719252610848},"reference-count":45,"publisher":"MIT Press","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computational Linguistics"],"published-print":{"date-parts":[[2017,9]]},"abstract":"Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.<\/jats:p>","DOI":"10.1162\/coli_a_00293","type":"journal-article","created":{"date-parts":[[2017,6,9]],"date-time":"2017-06-09T17:06:51Z","timestamp":1497028011000},"page":"567-592","source":"Crossref","is-referenced-by-count":6,"title":["A Kernel Independence Test for Geographical Language Variation"],"prefix":"10.1162","volume":"43","author":[{"given":"Dong","family":"Nguyen","sequence":"first","affiliation":[{"name":"University of Twente"}]},{"given":"Jacob","family":"Eisenstein","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology"}]}],"member":"281","reference":[{"key":"bib1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1538-4632.1995.tb00338.x"},{"key":"bib2","doi-asserted-by":"publisher","DOI":"10.1162\/153244303768966085"},{"key":"bib5","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1995.tb02031.x"},{"key":"bib6","unstructured":"Bouchard-C\u00f4t\u00e9, Alexandre, Percy Liang, Thomas Griffiths, and Dan Klein. 2007. A probabilistic approach to diachronic phonology. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 887\u2013896, Prague."},{"key":"bib9","doi-asserted-by":"crossref","unstructured":"Collins, Michael and Nigel Duffy. 2001. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14, pages 625\u2013632, Vancouver.","DOI":"10.7551\/mitpress\/1120.003.0085"},{"key":"bib10","doi-asserted-by":"publisher","DOI":"10.1007\/BF00892986"},{"key":"bib11","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/E14-1011"},{"key":"bib12","doi-asserted-by":"publisher","DOI":"10.2307\/1400508"},{"key":"bib13","unstructured":"Eisenstein, Jacob, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1277\u20131287, Cambridge, MA."},{"key":"bib14","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0113114"},{"key":"bib16","unstructured":"Fukumizu, Kenji, Arthur Gretton, Xiaohai Sun, and Bernhard Sch\u00f6lkopf. 2007. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, pages 489\u2013496, Vancouver."},{"key":"bib18","doi-asserted-by":"publisher","DOI":"10.1111\/j.1538-4632.1992.tb00261.x"},{"key":"bib19","unstructured":"Goeman, Ton. 1999. T-deletie in Nederlandse dialecten; kwantitatieve analyse van structurele, ruimtelijke en temporele variatie. Ph.D. thesis, Vrije Universiteit Amsterdam."},{"key":"bib20","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fql038"},{"key":"bib22","unstructured":"Gretton, Arthur, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Sch\u00f6lkopf, and Alex J. Smola. 2008. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585\u2013592, Vancouver."},{"key":"bib23","unstructured":"Gretton, Arthur, Ralf Herbrich, Alexander Smola, Olivier Bousquet, and Bernhard Sch\u00f6lkopf. 2005b. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075\u20132129."},{"key":"bib24","unstructured":"Gretton, Arthur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K. Sriperumbudur. 2012. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems 25, pages 1205\u20131213, Lake Tahoe, NV."},{"key":"bib27","doi-asserted-by":"publisher","DOI":"10.1017\/S095439451100007X"},{"key":"bib28","doi-asserted-by":"publisher","DOI":"10.1017\/jlg.2013.3"},{"key":"bib29","doi-asserted-by":"publisher","DOI":"10.1017\/S0954394501133041"},{"key":"bib30","doi-asserted-by":"publisher","DOI":"10.1145\/2187836.2187940"},{"key":"bib31","doi-asserted-by":"publisher","DOI":"10.1145\/2736277.2741141"},{"key":"bib32","doi-asserted-by":"publisher","DOI":"10.1016\/j.compenvurbsys.2015.12.003"},{"key":"bib33","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K15-1011"},{"key":"bib34","doi-asserted-by":"publisher","DOI":"10.1111\/j.1538-4632.1984.tb00797.x"},{"key":"bib35","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/17.4.401"},{"key":"bib36","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-015-0427-9"},{"key":"bib37","doi-asserted-by":"publisher","DOI":"10.1080\/02693799308901981"},{"key":"bib38","doi-asserted-by":"publisher","DOI":"10.1111\/2041-210X.12425"},{"key":"bib39","unstructured":"Lodhi, Huma, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. Journal of Machine Learning Research, 2(Feb):419\u2013444."},{"key":"bib40","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/37.1-2.17"},{"key":"bib41","doi-asserted-by":"crossref","unstructured":"Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch\u00f6lkopf. 2016. Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522.","DOI":"10.1561\/9781680832891"},{"key":"bib42","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fqs062"},{"key":"bib43","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1975.10480272"},{"key":"bib44","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2014.04.029"},{"key":"bib45","unstructured":"Roller, Stephen, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge. 2012. Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1500\u20131510, Jeju Island."},{"key":"bib46","unstructured":"Scherrer, Yves. 2012. Recovering dialect geography from an unaligned comparable corpus. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 63\u201371, Avignon."},{"key":"bib48","unstructured":"Song, Le, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research, 13:1393\u20131434."},{"key":"bib49","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/fql043"},{"key":"bib50","unstructured":"\u0160tajner, Sanja and Ruslan Mitkov. 2011. Diachronic stylistic changes in British and American varieties of 20th century written English language. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage at RANLP, pages 78\u201385, Hissar."},{"key":"bib52","unstructured":"Tjong Kim Sang, Erik. 2015. Discovering dialect regions in syntactic dialect data. In Workshop European Dialect Syntax VIII - Edisyn 2015, Zurich."},{"key":"bib53","unstructured":"Wing, Benjamin and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 955\u2013964, Portland, OR."},{"key":"bib54","doi-asserted-by":"publisher","DOI":"10.1098\/rspb.1996.0128"},{"key":"bib55","doi-asserted-by":"publisher","DOI":"10.1162\/NECO_a_00537"},{"key":"bib56","doi-asserted-by":"crossref","unstructured":"Zhang, Qinyi, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. 2016. Large-scale kernel methods for independence testing. arXiv preprint arXiv:1606.07892.","DOI":"10.1007\/s11222-016-9721-7"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/COLI_a_00293","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,24]],"date-time":"2024-06-24T17:56:39Z","timestamp":1719251799000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/43\/3\/567-592\/1577"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,9]]},"references-count":45,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2017,9]]}},"alternative-id":["10.1162\/COLI_a_00293"],"URL":"https:\/\/doi.org\/10.1162\/coli_a_00293","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,9]]}}}