Abstract
We introduce three resources to support research on political texts in Scandinavia. The encoder-decoder transformer models sp-t5 and sp-t5-keyword were trained on political texts. The nor-pvi (available at https://tinyurl.com/nor-pvi) data set comprises political viewpoints, stances, and summaries for Norwegian. Experiments with four distinct tasks show that large-scale models, such as nort5 perform slightly better. Still, sp-t5 and sp-t5-keyword perform almost on par and require much less data and computation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
https://www.stortinget.no/ (Accessed on 28 January 2024).
- 2.
For a more comprehensive treatment, see [4].
- 3.
- 4.
For evaluation purposes, these texts are excluded from the training data.
- 5.
We use both GPT-3.5 and GPT-4 version at https://chat.openai.com/.
- 6.
- 7.
- 8.
- 9.
Details at https://doi.org/10.7910/DVN/L4OAKN.
- 10.
- 11.
- 12.
The model was trained from mT5 checkpoint for 500K steps mainly on NCC dataset. See https://huggingface.co/north/t5_base_NCC.
- 13.
TPUs are special computing nodes operated by Google Cloud.
- 14.
- 15.
- 16.
- 17.
References
Borovikova, M., Ferré, A., Bossy, R., Roche, M., Nédellec, C.: Could Keyword Masking Strategy Improve Language Model? In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds.) NLDB 2023. LNCS, vol. 13913, pp. 271–284. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35320-8_19
Chin-Yew, L.: Looking for a Few Good Metrics: ROUGE and its Evaluation. In: Proceedings of the 4th NTCIR Workshops (2004)
Djemili, S., Longhi, J., Marinica, C., Kotzinos, D., Sarfati, G.E.: What does Twitter have to say about Ideology? In: NLP 4 CMC: Natural Language Processing for Computer-Mediated Communication/Social Media-Pre-conference Workshop at Konvens 2014. vol. 1. Universitätsverlag Hildesheim (2014)
Doan, T.M., Gulla, J.A.: A survey on political viewpoints identification. Online Soc. Networks Media 30 (2022). https://doi.org/10.1016/j.osnem.2022.100208
Doan, T.M., Kille, B., Gulla, J.A.: Using language models for classifying the party affiliation of political texts. In: Rosso, P., Basile, V., Martínez, R., Mètais, E., Meziane, F. (eds.) NLDB. LNCS, pp. 382–393. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08473-7_35
Doan, T.M., Kille, B., Gulla, J.A.: SP-BERT: a language model for political text in scandinavian languages. In: Metais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds.) NLDB 2023. LNCS, vol. 13913, pp. 467–477. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35320-8_34
Golchin, S., Surdeanu, M., Tavabi, N., Kiapour, A.: Do not mask randomly: effective domain-adaptive pre-training by masking in-domain keywords. In: Can, B., et al. (eds.) RepL4NLP. ACL (2023). https://doi.org/10.18653/v1/2023.repl4nlp-1.2
Hardalov, M., Arora, A., Nakov, P., Augenstein, I.: Cross-domain label-adaptive stance detection. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.T. (eds.) CEMNLP. ACL (2021). https://doi.org/10.18653/v1/2021.emnlp-main.710
Hardalov, M., Arora, A., Nakov, P., Augenstein, I.: Few-shot cross-lingual stance detection with sentiment-based pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36 (2022)
Hu, Y., et al.: ConfliBERT: a pre-trained language model for political conflict and violence. In: NAACL. ACL (2022). https://doi.org/10.18653/v1/2022.naacl-main.400
Hvingelby, R., Pauli, A.B., Barrett, M., Rosted, C., Lidegaard, L.M., Søgaard, A.: DaNE: a named entity resource for Danish. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4597–4604 (2020)
Iyyer, M., Enns, P., Boyd-Graber, J., Resnik, P.: Political ideology detection using recursive neural networks. ACL 1 (2014). https://doi.org/10.3115/v1/P14-1105
Kannangara, S.: Mining Twitter for fine-grained political opinion polarity classification, ideology detection and sarcasm detection. In: WSDM. ACM (2018). https://doi.org/10.1145/3159652.3170461
Kummervold, P.E., Wetjen, F., De la Rosa, J.: The NORWEGIAN colossal corpus: a text corpus for training large norwegian language models. In: LREC. European Language Resources Association (2022)
Kummervold, P.E., De la Rosa, J., Wetjen, F., Brygfjeld, S.A.: Operationalizing a national digital library: the case for a Norwegian transformer model. In: NoDaLiDa (2021)
Kutuzov, A., Barnes, J., Velldal, E., Øvrelid, L., Oepen, S.: Large-scale contextualised language modelling for Norwegian. In: NoDaLiDa. Linköping University Electronic Press, Sweden (2021)
Lapponi, E., Søyland, M.G., Velldal, E., Oepen, S.: The Talk of Norway: a richly annotated corpus of the Norwegian parliament, 1998–2016. LREC, pp. 1–21 (2018). https://doi.org/10.1007/s10579-018-9411-5
Lin, W.H., Wilson, T., Wiebe, J., Hauptmann, A.: Which side are you on? IDENTIFYING perspectives at the document and sentence Levels. In: CoNLL-X. ACL (2006)
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics 8, 726–742 (2020)
Liu, Y., Zhang, X.F., Wegsman, D., Beauchamp, N., Wang, L.: POLITICS: pretraining with same-story article comparison for ideology prediction and stance detection. In: Findings of the Association for Computational Linguistics: NAACL 2022. ACL (2022). https://doi.org/10.18653/v1/2022.findings-naacl.101
Maagerø, E. and Simonsen, B.: Norway: Society and Culture. Cappelen Damm Akademisk, 3rd edn. (2022)
Malmsten, M., Börjeson, L., Haffenden, C.: Playing with Words at the National Library of Sweden - Making a Swedish BERT. CoRR abs/2007.01658 (2020). https://arxiv.org/abs/2007.01658
Menini, S., Tonelli, S.: Agreement and disagreement: comparison of points of view in the political domain. In: COLING 2016, the 26th International Conference on Computational Linguistics, pp. 2461–2470 (2016)
M’rabet, Y., Demner-Fushman, D.: HOLMS: alternative summary evaluation with large language models. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5679–5688 (2020)
Paul, M., Girju, R.: A two-dimensional topic-aspect model for discovering multi-faceted topics. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pp. 545–550. AAAI 2010, AAAI Press (2010)
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. ACL (2018). https://www.aclweb.org/anthology/W18-6319
Rauh, C., Schwalbach, J.: The ParlSpeech V2 data set: full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies (2020). https://doi.org/10.7910/DVN/L4OAKN
Samuel, D., et al.: NorBench – a benchmark for Norwegian language models. In: NoDaLiDa. University of Tartu Library (2023)
Shazeer, N., Stern, M.: Adafactor: adaptive learning rates with sublinear memory cost. In: ICML, pp. 4596–4604. PMLR (2018)
Snæbjarnarson, V., et al.: A warm start and a clean crawled corpus - a recipe for good language models. In: LREC, pp. 4356–4366. ELRA, Marseille, France (2022)
Solberg, P.E., Ortiz, P.: The Norwegian Parliamentary Speech Corpus. arXiv preprint arXiv:2201.10881 (2022)
Steingrímsson, S., Barkarson, S., Örnólfsson, G.T.: IGC-parl: Icelandic corpus of parliamentary proceedings. In: Proceedings of the Second ParlaCLARIN Workshop. pp. 11–17. ELRA, Marseille, France (2020)
Thonet, T., Cabanac, G., Boughanem, M., Pinel-Sauvagnat, K.: VODUM: a topic model unifying viewpoint, topic and opinion discovery. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 533–545. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_39
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). ELRA (2012)
Vamvas, J., Sennrich, R.: X-Stance: A Multilingual Multi-Target Dataset for Stance Detection. CoRR abs/2003.08385 (2020). https://arxiv.org/abs/2003.08385
Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076 (2019)
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. In: NAACL. ACL (2021). https://doi.org/10.18653/v1/2021.naacl-main.41
Yang, D., Zhang, Z., Zhao, H.: Learning better masking for better language model pre-training. arXiv preprint arXiv:2208.10806 (2022)
Acknowledgements
This work is done as part of Trondheim Analytica project and funded under Digital Transformation program at Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway. This work has been partly funded by the SFI NorwAI, (Center for Research-based Innovation, 309834). Model training was supported by Cloud TPUs from Google’s TPU Research Cloud program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Doan, T.M., Baumgartner, D., Kille, B., Gulla, J.A. (2024). Automatically Detecting Political Viewpoints in Norwegian Text. In: Miliou, I., Piatkowski, N., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14641. Springer, Cham. https://doi.org/10.1007/978-3-031-58547-0_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-58547-0_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58546-3
Online ISBN: 978-3-031-58547-0
eBook Packages: Computer ScienceComputer Science (R0)