Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family | SpringerLink
Skip to main content

Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14969))

Included in the following conference series:

  • 74 Accesses

Abstract

Sentence encoder encode the semantics of their input, enabling key downstream applications such as classification, clustering, or retrieval. In this paper, we present Serafim, a family of open-source sentence encoders for Portuguese with various sizes, suited to different hardware/compute budgets. Each model exhibits state-of-the-art performance and is made openly available under a permissive license, allowing its use for both commercial and research purposes. Besides the sentence encoders, this paper contributes a systematic study and lessons learned concerning the selection criteria of learning objectives and parameters that support top-performing encoders.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7550
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9437
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    As a way of confirmation of this remark, the top performing model in the SuperGLUE benchmark (https://super.gluebenchmark.com/leaderboard) is an encoder, namely the Vega v2 model [30] at the time of this writing.

  2. 2.

    Serafim models are available at https://huggingface.co/PORTULAN.

  3. 3.

    https://www.sbert.net/docs/pretrained_models.html.

  4. 4.

    https://huggingface.co/spaces/mteb/leaderboard.

  5. 5.

    https://huggingface.co/jmbrito/ptbr-similarity-e5-small.

  6. 6.

    https://huggingface.co/mteb-pt.

  7. 7.

    The test set is found in the msmarco-test2019-queries.tsv file, dowloaded from the webpage of the 2019 edition of the TREC Deep Learning Track (https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html).

  8. 8.

    Their fully fledged identification is in Table 5.

  9. 9.

    From the stjiris encoder collection [12], a particular one was selected when it was top performing among them all in at least one of the test datasets according to their Hugging Face models cards. It is worth noting that Pearson correlation scores are reported there, while here, Table 3 reports Spearman correlation scores.

References

  1. Bajaj, P., Campos, D., Craswell, N., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv:1611.09268 (2018)

  2. Bonifacio, L., Jeronymo, V., Queiroz Abonizio, H., et al.: mMARCO: a multilingual version of the MS MARCO passage ranking dataset. arXiv:2108.13897 (2022)

  3. Carlsson, F., Gogoulou, E., Yliää, E., Cuba Gyllensten, A., Sahlgren, M.: Semantic re-tuning with contrastive tension. In: ICLR (2021)

    Google Scholar 

  4. Cer, D., Diab, M., et al.: SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th SemEval (2017)

    Google Scholar 

  5. EUbookshop. https://bookshop.europa.eu/

  6. Fonseca, E., Santos, L., Criscuolo, M., Aluísio, S.: ASSIN: avaliação de similaridade semêntica e inferência textual. In: 12th PROPOR, pp. 13–15 (2016)

    Google Scholar 

  7. Gomes, J.R.S.: PLUE: Portuguese language understanding evaluation (2020). https://github.com/ju-resplande/PLUE

  8. Henderson, M., et al.: Efficient natural language response suggestion for smart reply (2017)

    Google Scholar 

  9. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp. 79–86 (2005)

    Google Scholar 

  10. Li, X., Li, J.: AnglE-optimized text embeddings. arXiv:2309.12871 (2023)

  11. Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., Zhang, M.: Towards general text embeddings with multi-stage contrastive learning. arXiv:2308.03281 (2023)

  12. Melo, R., Santos, P.A., Dias, J.: A semantic search system for the Supremo Tribunal de Justiça. In: Progress in Artificial Intelligence, pp. 142–154 (2023)

    Google Scholar 

  13. Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: massive text embedding benchmark. arXiv:2210.07316 (2022)

  14. Osório, T., et al.: PORTULAN ExtraGLUE datasets and models: kick-starting a benchmark for the neural processing of Portuguese. In: BUCC Workshop (2024)

    Google Scholar 

  15. Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Proceedings of the 14th PROPOR, pp. 406–412 (2020)

    Google Scholar 

  16. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP-IJCNLP, pp. 3982–3992 (2019)

    Google Scholar 

  17. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of EMNLP, pp. 4512–4525 (2020)

    Google Scholar 

  18. Rodrigues, J., Gomes, L., Silva, J., et al.: Advancing neural encoding of Portuguese with Transformer Albertina PT-*. In: Proceedings of EPIA (2023)

    Google Scholar 

  19. Santos, R., Rodrigues, J., Gomes, L., et al.: Fostering the ecosystem of open neural encoders for Portuguese with Albertina PT* family. arXiv:2403.01897 (2024)

  20. Santos, R., Silva, J., et al.: Advancing generative AI for Portuguese with open decoder Gervásio PT*. In: SIGUL workshop (2024)

    Google Scholar 

  21. Solatorio, A.V.: GISTEmbed: guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829 (2024)

  22. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Intelligent Systems, pp. 403–417 (2020)

    Google Scholar 

  23. STSb Multi MT. https://huggingface.co/datasets/PhilipMay/stsb_multi_mt

  24. Su, J.: CoSENT: a more effective sentence vector scheme than Sentence BERT. https://kexue.fm/archives/8847

  25. Tatoeba. https://tatoeba.org/

  26. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)

    Google Scholar 

  27. Vaswani, A., et al.: Attention is all you need. NeurIPS 30 (2017)

    Google Scholar 

  28. Wang, A., Singh, A., Michael, J., et al.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the EMNLP Workshop BlackboxNLP (2018)

    Google Scholar 

  29. Wang, K., Reimers, N., Gurevych, I.: TSDAE: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 671–688 (2021)

    Google Scholar 

  30. Zhong, Q., et al.: Toward efficient language model pretraining and downstream adaptation via self-evolution: a case study on SuperGLUE. arXiv:2212.01853 (2022)

Download references

Acknowledgement

This work was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT (PINFRA/22117/2016); ACCELERAT.AI—Multilingual Intelligent Contact Centers, funded by IAPMEI (C625734525-00462629); and IMPROMPT—Image Alteration with Language Prompts, funded by FCT (CPCA-IAC/AV/590897/2023).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luís Gomes .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gomes, L., Branco, A., Silva, J., Rodrigues, J., Santos, R. (2025). Open Sentence Embeddings for Portuguese with the Serafim PT* Encoders Family. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14969. Springer, Cham. https://doi.org/10.1007/978-3-031-73503-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73503-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73502-8

  • Online ISBN: 978-3-031-73503-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics