Abstract
Sentence encoder encode the semantics of their input, enabling key downstream applications such as classification, clustering, or retrieval. In this paper, we present Serafim, a family of open-source sentence encoders for Portuguese with various sizes, suited to different hardware/compute budgets. Each model exhibits state-of-the-art performance and is made openly available under a permissive license, allowing its use for both commercial and research purposes. Besides the sentence encoders, this paper contributes a systematic study and lessons learned concerning the selection criteria of learning objectives and parameters that support top-performing encoders.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
As a way of confirmation of this remark, the top performing model in the SuperGLUE benchmark (https://super.gluebenchmark.com/leaderboard) is an encoder, namely the Vega v2 model [30] at the time of this writing.
- 2.
Serafim models are available at https://huggingface.co/PORTULAN.
- 3.
- 4.
- 5.
- 6.
- 7.
The test set is found in the msmarco-test2019-queries.tsv file, dowloaded from the webpage of the 2019 edition of the TREC Deep Learning Track (https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html).
- 8.
Their fully fledged identification is in Table 5.
- 9.
From the stjiris encoder collection [12], a particular one was selected when it was top performing among them all in at least one of the test datasets according to their Hugging Face models cards. It is worth noting that Pearson correlation scores are reported there, while here, Table 3 reports Spearman correlation scores.
References
Bajaj, P., Campos, D., Craswell, N., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv:1611.09268 (2018)
Bonifacio, L., Jeronymo, V., Queiroz Abonizio, H., et al.: mMARCO: a multilingual version of the MS MARCO passage ranking dataset. arXiv:2108.13897 (2022)
Carlsson, F., Gogoulou, E., Yliää, E., Cuba Gyllensten, A., Sahlgren, M.: Semantic re-tuning with contrastive tension. In: ICLR (2021)
Cer, D., Diab, M., et al.: SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th SemEval (2017)
EUbookshop. https://bookshop.europa.eu/
Fonseca, E., Santos, L., Criscuolo, M., Aluísio, S.: ASSIN: avaliação de similaridade semêntica e inferência textual. In: 12th PROPOR, pp. 13–15 (2016)
Gomes, J.R.S.: PLUE: Portuguese language understanding evaluation (2020). https://github.com/ju-resplande/PLUE
Henderson, M., et al.: Efficient natural language response suggestion for smart reply (2017)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp. 79–86 (2005)
Li, X., Li, J.: AnglE-optimized text embeddings. arXiv:2309.12871 (2023)
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., Zhang, M.: Towards general text embeddings with multi-stage contrastive learning. arXiv:2308.03281 (2023)
Melo, R., Santos, P.A., Dias, J.: A semantic search system for the Supremo Tribunal de Justiça. In: Progress in Artificial Intelligence, pp. 142–154 (2023)
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: massive text embedding benchmark. arXiv:2210.07316 (2022)
Osório, T., et al.: PORTULAN ExtraGLUE datasets and models: kick-starting a benchmark for the neural processing of Portuguese. In: BUCC Workshop (2024)
Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Proceedings of the 14th PROPOR, pp. 406–412 (2020)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP-IJCNLP, pp. 3982–3992 (2019)
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of EMNLP, pp. 4512–4525 (2020)
Rodrigues, J., Gomes, L., Silva, J., et al.: Advancing neural encoding of Portuguese with Transformer Albertina PT-*. In: Proceedings of EPIA (2023)
Santos, R., Rodrigues, J., Gomes, L., et al.: Fostering the ecosystem of open neural encoders for Portuguese with Albertina PT* family. arXiv:2403.01897 (2024)
Santos, R., Silva, J., et al.: Advancing generative AI for Portuguese with open decoder Gervásio PT*. In: SIGUL workshop (2024)
Solatorio, A.V.: GISTEmbed: guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829 (2024)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Intelligent Systems, pp. 403–417 (2020)
STSb Multi MT. https://huggingface.co/datasets/PhilipMay/stsb_multi_mt
Su, J.: CoSENT: a more effective sentence vector scheme than Sentence BERT. https://kexue.fm/archives/8847
Tatoeba. https://tatoeba.org/
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)
Vaswani, A., et al.: Attention is all you need. NeurIPS 30 (2017)
Wang, A., Singh, A., Michael, J., et al.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the EMNLP Workshop BlackboxNLP (2018)
Wang, K., Reimers, N., Gurevych, I.: TSDAE: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 671–688 (2021)
Zhong, Q., et al.: Toward efficient language model pretraining and downstream adaptation via self-evolution: a case study on SuperGLUE. arXiv:2212.01853 (2022)
Acknowledgement
This work was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT (PINFRA/22117/2016); ACCELERAT.AI—Multilingual Intelligent Contact Centers, funded by IAPMEI (C625734525-00462629); and IMPROMPT—Image Alteration with Language Prompts, funded by FCT (CPCA-IAC/AV/590897/2023).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gomes, L., Branco, A., Silva, J., Rodrigues, J., Santos, R. (2025). Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-73503-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73502-8
Online ISBN: 978-3-031-73503-5
eBook Packages: Computer ScienceComputer Science (R0)