The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

Boillet, Mélodie; Tarride, Solène; Schneider, Yoann; Abadie, Bastien; Kesztenbaum, Lionel; Kermorvant, Christopher

doi:10.1007/978-3-031-70543-4_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14806))

Included in the following conference series:

International Conference on Document Analysis and Recognition

286 Accesses

Abstract

This paper presents a complete processing workflow for extracting information from French census lists from 1836 to 1936. These lists contain information about individuals living in France and their households. We aim at extracting all the information contained in these tables using automatic handwritten table recognition. At the end of the Socface project, in which our work is taking place, the extracted information will be redistributed to the departmental archives, and the nominative lists will be freely available to the public, allowing anyone to browse hundreds of millions of records. The extracted data will be used by demographers to analyze social change over time, significantly improving our understanding of French economic and social structures. For this project, we developed a complete processing workflow: large-scale data collection from French departmental archives, collaborative annotation of documents, training of handwritten table text and structure recognition models, and mass processing of millions of images.

We present the tools we have developed to easily collect and process millions of pages. We also show that it is possible to process such a wide variety of tables with a single table recognition model that uses the image of the entire page to recognize information about individuals, categorize them and automatically group them into households. The entire process has been successfully used to process the documents of a departmental archive, representing more than 450,000 images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An end-to-end pipeline for historical censuses processing

Article Open access 17 March 2023

Text Line Detection in Historical Index Tables: Evaluations on a New French PArish REcord Survey Dataset (PARES)

Tabular Data Extraction From Documents

Notes

References

Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12 (2018)
Google Scholar
Bernard, G., Wall, C., Boillet, M., Coustaty, M., Kermorvant, C., Doucet, A.: Text line detection in historical index tables: evaluations on a new French PArish REcord survey dataset (PARES). In: Goh, D.H., Chen, S.J., Tuarob, S. (eds.) ICADL 2023. LNCS, vol. 14457, pp. 59–75. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8085-7_6
Chapter Google Scholar
Biswas, S., Banerjee, A., Lladós, J., Pal, U.: DocSegTr: an instance-level end-to-end document image segmentation transformer. In: arXiv preprint arXiv:2201.11438 (2022)
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 25th International Conference on Pattern Recognition (ICPR), pp. 2134–2141 (2021)
Google Scholar
Constum, T., et al.: Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census. In: 15th International Workshop on Document Analysis Systems (DAS), pp. 143–157 (2022). https://doi.org/10.1007/978-3-031-06555-2_10
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–17. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/tpami.2023.3235826
Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 508–524 (2023). https://doi.org/10.1109/TPAMI.2022.3144899
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (ICPR), pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Grüning, T., Leifert, G., Strauß, T., Labahn, R.: A two-stage method for text line detection in historical documents. In: International Journal on Document Analysis and Recognition (IJDAR), pp. 285–302 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Kermorvant, C., Bardou, E., Blanco, M., Abadie, B.: Callico: a versatile open-source document image annotation platform. In: Sumbitted to ICDAR2024 (2024)
Google Scholar
Motte, C., Vouloir, M.C.: Le site cassini.ehess.fr. Un instrument d’observation pour une analyse du peuplement. Bulletin du Comité français de cartographie 191, 68–84 (2007)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: 28th International Conference on Neural Information Processing Systems (NIPS), pp. 91–99 (2015)
Google Scholar
Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4634–4642 (2022)
Google Scholar
Tarride, S., et al.: Large-scale genealogical information extraction from handwritten Quebec Parish records. Int. J. Doc. Anal. Recogn. 26(3), 255–272 (2023). https://doi.org/10.1007/s10032-023-00427-w
Tarride, S., Boillet, M., Kermorvant, C.: Key-value information extraction from full handwritten pages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14188, pp. 185–204. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_11
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: 31st International Conference on Neural Information Processing Systems (NIPS), pp. 6000–6010 (2017)
Google Scholar

Download references

Acknowledgments

The Socface project is funded by the French National Research Agency (ANR) under the fund ANR-21-CE38-0013. This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011013446 made by GENCI and was partially funded by the ACADIIE project “Compréhension automatique des documents d’archives pour l’extraction d’informations individuelles” supported by a grant overseen by the French National Research Agency (ANR) as part of the France Relance program.

Author information

Authors and Affiliations

TEKLIA, Paris, France
Mélodie Boillet, Solène Tarride, Yoann Schneider, Bastien Abadie & Christopher Kermorvant
Institut National d’Etudes Démographiques (INED) and Paris School of Economics (PSE), Paris, France
Lionel Kesztenbaum

Authors

Mélodie Boillet
View author publications
You can also search for this author in PubMed Google Scholar
Solène Tarride
View author publications
You can also search for this author in PubMed Google Scholar
Yoann Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Bastien Abadie
View author publications
You can also search for this author in PubMed Google Scholar
Lionel Kesztenbaum
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Kermorvant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Solène Tarride .

Editor information

Editors and Affiliations

Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boillet, M., Tarride, S., Schneider, Y., Abadie, B., Kesztenbaum, L., Kermorvant, C. (2024). The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-70543-4_4
Published: 09 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70542-7
Online ISBN: 978-3-031-70543-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An end-to-end pipeline for historical censuses processing

Text Line Detection in Historical Index Tables: Evaluations on a New French PArish REcord Survey Dataset (PARES)

Tabular Data Extraction From Documents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An end-to-end pipeline for historical censuses processing

Text Line Detection in Historical Index Tables: Evaluations on a New French PArish REcord Survey Dataset (PARES)

Tabular Data Extraction From Documents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation