On the Impact of Cross-Domain Data on German Language Models

Dada, Amin; Chen, Aokun; Peng, Cheng; Smith, Kaleb E; Idrissi-Yaghir, Ahmad; Seibold, Constantin Marc; Li, Jianning; Heiliger, Lars; Yang, Xi; Friedrich, Christoph M.; Truhn, Daniel; Egger, Jan; Bian, Jiang; Kleesiek, Jens; Wu, Yonghui

Computer Science > Computation and Language

arXiv:2310.07321 (cs)

[Submitted on 11 Oct 2023 (v1), last revised 13 Oct 2023 (this version, v2)]

Title:On the Impact of Cross-Domain Data on German Language Models

Authors:Amin Dada, Aokun Chen, Cheng Peng, Kaleb E Smith, Ahmad Idrissi-Yaghir, Constantin Marc Seibold, Jianning Li, Lars Heiliger, Xi Yang, Christoph M. Friedrich, Daniel Truhn, Jan Egger, Jiang Bian, Jens Kleesiek, Yonghui Wu

View PDF

Abstract:Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at this https URL

Comments:	13 pages, 1 figure, accepted at Findings of the Association for Computational Linguistics: EMNLP 2023
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.07321 [cs.CL]
	(or arXiv:2310.07321v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.07321

Submission history

From: Amin Dada [view email]
[v1] Wed, 11 Oct 2023 09:09:55 UTC (8,242 KB)
[v2] Fri, 13 Oct 2023 14:24:31 UTC (8,242 KB)

Computer Science > Computation and Language

Title:On the Impact of Cross-Domain Data on German Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:On the Impact of Cross-Domain Data on German Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators