The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses | SpringerLink
Skip to main content

The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Abstract

This paper presents a complete processing workflow for extracting information from French census lists from 1836 to 1936. These lists contain information about individuals living in France and their households. We aim at extracting all the information contained in these tables using automatic handwritten table recognition. At the end of the Socface project, in which our work is taking place, the extracted information will be redistributed to the departmental archives, and the nominative lists will be freely available to the public, allowing anyone to browse hundreds of millions of records. The extracted data will be used by demographers to analyze social change over time, significantly improving our understanding of French economic and social structures. For this project, we developed a complete processing workflow: large-scale data collection from French departmental archives, collaborative annotation of documents, training of handwritten table text and structure recognition models, and mass processing of millions of images.

We present the tools we have developed to easily collect and process millions of pages. We also show that it is possible to process such a wide variety of tables with a single table recognition model that uses the image of the entire page to recognize information about individuals, categorize them and automatically group them into households. The entire process has been successfully used to process the documents of a departmental archive, representing more than 450,000 images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://socface.site.ined.fr/.

  2. 2.

    https://docs.ultralytics.com/tasks/classify/.

  3. 3.

    https://pyslurm.github.io/.

References

  1. Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12 (2018)

    Google Scholar 

  2. Bernard, G., Wall, C., Boillet, M., Coustaty, M., Kermorvant, C., Doucet, A.: Text line detection in historical index tables: evaluations on a new French PArish REcord survey dataset (PARES). In: Goh, D.H., Chen, S.J., Tuarob, S. (eds.) ICADL 2023. LNCS, vol. 14457, pp. 59–75. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8085-7_6

    Chapter  Google Scholar 

  3. Biswas, S., Banerjee, A., Lladós, J., Pal, U.: DocSegTr: an instance-level end-to-end document image segmentation transformer. In: arXiv preprint arXiv:2201.11438 (2022)

  4. Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 25th International Conference on Pattern Recognition (ICPR), pp. 2134–2141 (2021)

    Google Scholar 

  5. Constum, T., et al.: Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census. In: 15th International Workshop on Document Analysis Systems (DAS), pp. 143–157 (2022). https://doi.org/10.1007/978-3-031-06555-2_10

  6. Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–17. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/tpami.2023.3235826

  7. Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 508–524 (2023). https://doi.org/10.1109/TPAMI.2022.3144899

  8. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (ICPR), pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  9. Grüning, T., Leifert, G., Strauß, T., Labahn, R.: A two-stage method for text line detection in historical documents. In: International Journal on Document Analysis and Recognition (IJDAR), pp. 285–302 (2019)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  11. Kermorvant, C., Bardou, E., Blanco, M., Abadie, B.: Callico: a versatile open-source document image annotation platform. In: Sumbitted to ICDAR2024 (2024)

    Google Scholar 

  12. Motte, C., Vouloir, M.C.: Le site cassini.ehess.fr. Un instrument d’observation pour une analyse du peuplement. Bulletin du Comité français de cartographie 191, 68–84 (2007)

    Google Scholar 

  13. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: 28th International Conference on Neural Information Processing Systems (NIPS), pp. 91–99 (2015)

    Google Scholar 

  14. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4634–4642 (2022)

    Google Scholar 

  15. Tarride, S., et al.: Large-scale genealogical information extraction from handwritten Quebec Parish records. Int. J. Doc. Anal. Recogn. 26(3), 255–272 (2023). https://doi.org/10.1007/s10032-023-00427-w

  16. Tarride, S., Boillet, M., Kermorvant, C.: Key-value information extraction from full handwritten pages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14188, pp. 185–204. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_11

    Chapter  Google Scholar 

  17. Vaswani, A., et al.: Attention is all you need. In: 31st International Conference on Neural Information Processing Systems (NIPS), pp. 6000–6010 (2017)

    Google Scholar 

Download references

Acknowledgments

The Socface project is funded by the French National Research Agency (ANR) under the fund ANR-21-CE38-0013. This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011013446 made by GENCI and was partially funded by the ACADIIE project “Compréhension automatique des documents d’archives pour l’extraction d’informations individuelles” supported by a grant overseen by the French National Research Agency (ANR) as part of the France Relance program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Solène Tarride .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boillet, M., Tarride, S., Schneider, Y., Abadie, B., Kesztenbaum, L., Kermorvant, C. (2024). The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70543-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70542-7

  • Online ISBN: 978-3-031-70543-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics