CatCoLA, Catalan Corpus of Linguistic Acceptability
Resumen
We introduce CatCoLA, the Catalan Corpus of Linguistic Acceptability that will contribute to the Catalan Language Understanding Benchmark (CLUB) to assess and compare the capabilities of language models (LM) trained with texts in Catalan. CatCoLA follows the design of the English CoLA to support the task of classifying sentences as acceptable or not. Because the task is very dependent on the characteristics of particular languages, datasets cannot be translated from one language to another and the availability of these datasets for different languages requires specific developments. CatCoLA consists of 10,443 sentences and their acceptability judgements as found in well-known Catalan reference grammars. Additionally, all sentences have been annotated with the class of linguistic phenomenon the sentence is an example of, also following previous practices. We also provide as task baselines the results of fine-tuning four different language models with this dataset and the results of a human annotation experiment. The results are also analyzed and commented to guide future research. CatCoLA is released under a CC BY SA 4.0 licence and freely available at https://doi.org/10.34810/data1393.