Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family

Gomes, Luís; Branco, António; Silva, João; Rodrigues, João; Santos, Rodrigo

doi:10.1007/978-3-031-73503-5_22

Luís Gomes¹²,
António Branco¹²,
João Silva¹²,
João Rodrigues¹² &
…
Rodrigo Santos¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14969))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

168 Accesses

Abstract

Sentence encoder encode the semantics of their input, enabling key downstream applications such as classification, clustering, or retrieval. In this paper, we present Serafim, a family of open-source sentence encoders for Portuguese with various sizes, suited to different hardware/compute budgets. Each model exhibits state-of-the-art performance and is made openly available under a permissive license, allowing its use for both commercial and research purposes. Besides the sentence encoders, this paper contributes a systematic study and lessons learned concerning the selection criteria of learning objectives and parameters that support top-performing encoders.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 7550; Price includes VAT (Japan)

Softcover Book: JPY 9437; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Costra 1.1: An Inquiry into Geometric Properties of Sentence Spaces

Learning Word and Sentence Embeddings Using a Generative Convolutional Network

Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models

Notes

1.
As a way of confirmation of this remark, the top performing model in the SuperGLUE benchmark (https://super.gluebenchmark.com/leaderboard) is an encoder, namely the Vega v2 model [30] at the time of this writing.
2.
Serafim models are available at https://huggingface.co/PORTULAN.
3.
https://www.sbert.net/docs/pretrained_models.html.
4.
https://huggingface.co/spaces/mteb/leaderboard.
5.
https://huggingface.co/jmbrito/ptbr-similarity-e5-small.
6.
https://huggingface.co/mteb-pt.
7.
The test set is found in the msmarco-test2019-queries.tsv file, dowloaded from the webpage of the 2019 edition of the TREC Deep Learning Track (https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html).
8.
Their fully fledged identification is in Table 5.
9.
From the stjiris encoder collection [12], a particular one was selected when it was top performing among them all in at least one of the test datasets according to their Hugging Face models cards. It is worth noting that Pearson correlation scores are reported there, while here, Table 3 reports Spearman correlation scores.

References

Bajaj, P., Campos, D., Craswell, N., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv:1611.09268 (2018)
Bonifacio, L., Jeronymo, V., Queiroz Abonizio, H., et al.: mMARCO: a multilingual version of the MS MARCO passage ranking dataset. arXiv:2108.13897 (2022)
Carlsson, F., Gogoulou, E., Yliää, E., Cuba Gyllensten, A., Sahlgren, M.: Semantic re-tuning with contrastive tension. In: ICLR (2021)
Google Scholar
Cer, D., Diab, M., et al.: SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th SemEval (2017)
Google Scholar
EUbookshop. https://bookshop.europa.eu/
Fonseca, E., Santos, L., Criscuolo, M., Aluísio, S.: ASSIN: avaliação de similaridade semêntica e inferência textual. In: 12th PROPOR, pp. 13–15 (2016)
Google Scholar
Gomes, J.R.S.: PLUE: Portuguese language understanding evaluation (2020). https://github.com/ju-resplande/PLUE
Henderson, M., et al.: Efficient natural language response suggestion for smart reply (2017)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp. 79–86 (2005)
Google Scholar
Li, X., Li, J.: AnglE-optimized text embeddings. arXiv:2309.12871 (2023)
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., Zhang, M.: Towards general text embeddings with multi-stage contrastive learning. arXiv:2308.03281 (2023)
Melo, R., Santos, P.A., Dias, J.: A semantic search system for the Supremo Tribunal de Justiça. In: Progress in Artificial Intelligence, pp. 142–154 (2023)
Google Scholar
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: massive text embedding benchmark. arXiv:2210.07316 (2022)
Osório, T., et al.: PORTULAN ExtraGLUE datasets and models: kick-starting a benchmark for the neural processing of Portuguese. In: BUCC Workshop (2024)
Google Scholar
Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Proceedings of the 14th PROPOR, pp. 406–412 (2020)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP-IJCNLP, pp. 3982–3992 (2019)
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of EMNLP, pp. 4512–4525 (2020)
Google Scholar
Rodrigues, J., Gomes, L., Silva, J., et al.: Advancing neural encoding of Portuguese with Transformer Albertina PT-*. In: Proceedings of EPIA (2023)
Google Scholar
Santos, R., Rodrigues, J., Gomes, L., et al.: Fostering the ecosystem of open neural encoders for Portuguese with Albertina PT* family. arXiv:2403.01897 (2024)
Santos, R., Silva, J., et al.: Advancing generative AI for Portuguese with open decoder Gervásio PT*. In: SIGUL workshop (2024)
Google Scholar
Solatorio, A.V.: GISTEmbed: guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829 (2024)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Intelligent Systems, pp. 403–417 (2020)
Google Scholar
STSb Multi MT. https://huggingface.co/datasets/PhilipMay/stsb_multi_mt
Su, J.: CoSENT: a more effective sentence vector scheme than Sentence BERT. https://kexue.fm/archives/8847
Tatoeba. https://tatoeba.org/
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)
Google Scholar
Vaswani, A., et al.: Attention is all you need. NeurIPS 30 (2017)
Google Scholar
Wang, A., Singh, A., Michael, J., et al.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the EMNLP Workshop BlackboxNLP (2018)
Google Scholar
Wang, K., Reimers, N., Gurevych, I.: TSDAE: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 671–688 (2021)
Google Scholar
Zhong, Q., et al.: Toward efficient language model pretraining and downstream adaptation via self-evolution: a case study on SuperGLUE. arXiv:2212.01853 (2022)

Download references

Acknowledgement

This work was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT (PINFRA/22117/2016); ACCELERAT.AI—Multilingual Intelligent Contact Centers, funded by IAPMEI (C625734525-00462629); and IMPROMPT—Image Alteration with Language Prompts, funded by FCT (CPCA-IAC/AV/590897/2023).

Author information

Authors and Affiliations

NLX—Natural Language and Speech Group, Department of Informatics Faculdade de Ciências, Campo Grande, University of Lisbon, 1749-016, Lisboa, Portugal
Luís Gomes, António Branco, João Silva, João Rodrigues & Rodrigo Santos

Authors

Luís Gomes
View author publications
You can also search for this author in PubMed Google Scholar
António Branco
View author publications
You can also search for this author in PubMed Google Scholar
João Silva
View author publications
You can also search for this author in PubMed Google Scholar
João Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luís Gomes .

Editor information

Editors and Affiliations

University of Minho, Braga, Portugal
Manuel Filipe Santos
University of Minho, Braga, Portugal
José Machado
University of Minho, Braga, Portugal
Paulo Novais
University of Minho, Braga, Portugal
Paulo Cortez
Polytechnic Institute of Viana do Castelo, Viana do Castelo, Portugal
Pedro Miguel Moreira

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gomes, L., Branco, A., Silva, J., Rodrigues, J., Santos, R. (2025). Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-73503-5_22
Published: 16 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73502-8
Online ISBN: 978-3-031-73503-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family