Abstract
This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian English or Argentine Spanish, respectively, in every recording. The aim of this note is to document relevant aspects and specifications of the CoAmDA, viz. the audio data, the sociodemographic metadata of the speakers, the digitization process of analog data, the transcription procedures, the format and tagging of the speech files and the internal validation procedures. In so doing, we wish to share our experience and best practices with regard to achieving a spoken language resource of high quality with the interested public, in particular other researchers working on and with multilingual speech corpora.
Notes
Many thanks to the reviewers who provided useful input and feedback on earlier versions of this paper, and to Inger Mees for valuable input to the present version.
The name is actually misleading in the sense that this part of the corpus only contains recordings from Argentina. At some point in the corpus creation it was considered to integrate data from Brazil, too, which in the end did not work out. This is the reason for the name which has by now become established by use and cannot be changed without risking confusion. Still, data from other South American countries may be added at a later point of time, justifying the name.
Being a stand-alone corpus the LANCHART corpus is, for the time being, not embedded in an overarching digital infrastructure. The LANCHART corpus was compiled very early (from 2005 onwards), with only few established standards to follow.
The name ‘language tier’ is misleading in the sense that the tier also contains tags that do not denote a specific language such as proper nouns, hesitation phenomena etc. The naming is a consequence of the already established software products and denotations that the CoAmDa has adopted from the LANCHART Center.
The conversion program 'trs2praat.exe' was created in C#. The program is available upon request to Gert Foget Hansen, but it should be noted that the program was designed to convert Transcriber-files that adhere to the LANCHART transcription guidelines to Praat.
The assignment of language to sentences based on the language tag of the grammatical subject and the finite verb is admittedly a very coarse and categorical measure of language use. We do not with this tagging wish to make any claim as to whether a specific passage is perceived by the speaker as being Danish or another language, or a mixture of several languages.
For other studies, the interested reader is referred to the homepage of the Danish Voices-project, see https://danishvoices.ku.dk/ or the homepages of the authors of this paper.
References
Bakker, P., Heegård Petersen, J., & Kühl, K. (forthc.). De nye hjem. In Hjorth, E., Jacobsen, B., Galberg Jacobsen, H., Jørgensen, B., & Jørgensen, M. K. (eds.) Dansk Sproghistorie, vol. 5. København: Det Danske Sprog- og Litteraturselskab.
Bjerg, M. (1993). Living where the world ends: Danish settlements in the Argentine pampa. A brief analysis of ethnic leadership. In B. Flemming L, Bender, H., & Veien, K. (Eds.), On distant shores. Proceedings of the Marcus Lee Hansen Conference (pp. 157–174). Aalborg: Danes Worldwide Archives.
Bjerg, M. (2000). A tale of two settlements: Danish Immigrants on the American Prairie and the Argentine Pampa 1860–1930. The Annals of Iowa 59 (Winter 2000), 1–34.
Boas, H. C., Pierce, M., Weilbacher, H., Roesch, K., & Halder, G. (2010). The Texas German dialect archive. A multimedia resource for research, teaching, and outreach. Journal of Germanic Linguistics,22(3), 277–296. https://doi.org/10.1017/s1470542710000036.
Boas, H. C., & Weilbacher, H. (2006). Documenting Diaspora Experiences: The Texas German Dialect Archive. Proceedings of the Waterloo Conference on Diaspora Experiences.
Bouwsema, K. (2009). Danes in Alberta 1903-1939. A dynamic culture in an ‘invisible’ ethnic group (Master thesis). University of Calgary, Calgary, Alberta. Department of History.
Copeland, P. (2008). Manual of analogue sound restoration techniques. London: The British Library.
Gregersen, F., Maegaard, M., & Pharao, N. (2014). The LANCHART Corpus. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 534–545). Oxford: Oxford.
Grøngaard Jeppesen, T. (2005). Danske i USA 1850–2000. En demografisk, social og kulturgeografisk undersøgelse af de danske immigranter og deres efterkommere. Odense: Odense Bys Museer.
Grøngaard Jeppesen, T. (2011). Scandinavian descendants in the United States. Ethnic groups or core Americans?. Odense: Odense Bys Museer.
Hansen, N. (2016). En snert af dansk mellem urskov og pampa. Det danske i to danskerkolonier i Argentina. Edited by Heegård Petersen, J., & Kühl, K. https://danskestemmer.ku.dk/resultater/publikationer/En_snert_af_dansk_mellem_pampa_og_urskov_Nadia_Hansen.pdf. Accessed 15 Feb 2019.
Hansen, G. F., Kühl, K., & Heegård Petersen, J. (2018). Kan nordamerikadansk beskrives som en varietet af dansk? In T. K. Christensen, T. Juel Jensen, C. Fogtmann Fosgerau, Karrebæk, M. Maegaard, N. Pharao, & P. Quist (Eds.), Dansk i det 20. århundrede (pp. 121–134). Copenhagen: U Press.
Heegård Petersen, J., & Albris, J. (2018). Argentinadansk: De dansktalende samfund i Argentina. Mål og Mæle,39(1), 8–16.
Heegård Petersen, J., & Kühl, K. (2017). Argentinadansk: Semantiske, syntaktiske og morfologiske forskelle til rigsdansk. NyS,52–53, 231–258. https://doi.org/10.7146/nys.v1i52-53.102687.
Heegård Petersen, J., Thøgersen, J., & Hansen, G. F. (2019a). Correlations between linguistic change and linguistic performance among heritage speakers of Danish in Argentina. Linguistic Approaches to Bilingualism. https://doi.org/10.1075/lab.17068.pet.
Heegård Petersen, J., Thøgersen, J., Hansen, G. F., & Kühl, K. (2019b). Linguistic proficiency: A quantitative approach to immigrant and heritage speakers of Danish. Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2017-0088.
Johannessen, J.B. (2015). The Corpus of American Norwegian Speech (CANS). In Megyesi, B. (Ed.) Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania (pp. 296–300). NEALT Proceedings Series.
Kjær, I., & Baumann Larsen, M. (1978). Problems and observations of American-Danish. In Weinstock, J. (Ed.), The Nordic Languages and Modern Linguistics 3. Proceedings of the Third International Conference of Nordic and General Linguistics (pp. 189–191). Austin.
Kjær, I., & Baumann Larsen, M. (1992). The spoken Danish language in the U.S. From interaction to recollection. In Flemming Larsen, B. & Bender, H. (Ed.), Danish emigration to the U.S.A. (pp. 106–123). Aalborg: Danes Worldwide Archives.
Kristiansen, T., Harwood, J., & Giles, H. (1991). Ethnolinguistic vitality in ‘the Danish capital of America’. Journal of Multilingual and Multicultural Development,2(6), 421–448.
Kühl, K. (2019). New Denmark, Canada: An exceptional case of language maintenance in a Danish immigrant settlement. Journal of Historical Sociolinguistics, 5(1), 1–30. https://doi.org/10.1515/jhsl-2017-0042.
Kühl, K., Heegård Petersen, J., Hansen, G. F., & Gregersen, F. (2017). CoAmDa. Et nyt dansk talesprogskorpus. Danske talesprog, 131–160.
Transcriber. A tool for segmenting, labeling and transcribing speech. http://trans.sourceforge.net.
van den Heuvel, H., Iskra, D., Sanders, E., & de Vriend, F. (2008). Validation of spoken language resources: An overview of basic aspects. Language Resources and Evaluation,42, 41–73. https://doi.org/10.1007/s10579-007-9049-1.
Acknowledgements
This paper has been written within the framework of the research project ‘Danish Voices in the Americas’ (University of Copenhagen, 2014–2018), funded by the A.P Møller and Hustru Chastine Mc-Kinney Møller Fond til Almeene Formaal, the Carlsberg Foundation as a Semper Ardens project and the Faculty of Humanities at the University of Copenhagen. We wish to thank Professor emer. Inger Kjær for her generous donation of the recordings collected by Iver Kjær (1938–2002) and Mogens Baumann Larsen (1930–2001), Professor Christopher Hale (University of Alberta) for the donation of his recordings from New Denmark (Canada), Professor Tore Kristiansen for the contributions of his recordings from Solvang, California, and Anne Nesser, formerly Aarhus University, for the donation of the recordings from the DANA-project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kühl, K., Petersen, J.H. & Hansen, G.F. The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America. Lang Resources & Evaluation 54, 831–849 (2020). https://doi.org/10.1007/s10579-019-09473-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-019-09473-5