Abstract
This paper presents a complete processing workflow for extracting information from French census lists from 1836 to 1936. These lists contain information about individuals living in France and their households. We aim at extracting all the information contained in these tables using automatic handwritten table recognition. At the end of the Socface project, in which our work is taking place, the extracted information will be redistributed to the departmental archives, and the nominative lists will be freely available to the public, allowing anyone to browse hundreds of millions of records. The extracted data will be used by demographers to analyze social change over time, significantly improving our understanding of French economic and social structures. For this project, we developed a complete processing workflow: large-scale data collection from French departmental archives, collaborative annotation of documents, training of handwritten table text and structure recognition models, and mass processing of millions of images.
We present the tools we have developed to easily collect and process millions of pages. We also show that it is possible to process such a wide variety of tables with a single table recognition model that uses the image of the entire page to recognize information about individuals, categorize them and automatically group them into households. The entire process has been successfully used to process the documents of a departmental archive, representing more than 450,000 images.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12 (2018)
Bernard, G., Wall, C., Boillet, M., Coustaty, M., Kermorvant, C., Doucet, A.: Text line detection in historical index tables: evaluations on a new French PArish REcord survey dataset (PARES). In: Goh, D.H., Chen, S.J., Tuarob, S. (eds.) ICADL 2023. LNCS, vol. 14457, pp. 59–75. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-8085-7_6
Biswas, S., Banerjee, A., Lladós, J., Pal, U.: DocSegTr: an instance-level end-to-end document image segmentation transformer. In: arXiv preprint arXiv:2201.11438 (2022)
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 25th International Conference on Pattern Recognition (ICPR), pp. 2134–2141 (2021)
Constum, T., et al.: Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census. In: 15th International Workshop on Document Analysis Systems (DAS), pp. 143–157 (2022). https://doi.org/10.1007/978-3-031-06555-2_10
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–17. Institute of Electrical and Electronics Engineers (IEEE) (2023). https://doi.org/10.1109/tpami.2023.3235826
Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 508–524 (2023). https://doi.org/10.1109/TPAMI.2022.3144899
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (ICPR), pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Grüning, T., Leifert, G., Strauß, T., Labahn, R.: A two-stage method for text line detection in historical documents. In: International Journal on Document Analysis and Recognition (IJDAR), pp. 285–302 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Kermorvant, C., Bardou, E., Blanco, M., Abadie, B.: Callico: a versatile open-source document image annotation platform. In: Sumbitted to ICDAR2024 (2024)
Motte, C., Vouloir, M.C.: Le site cassini.ehess.fr. Un instrument d’observation pour une analyse du peuplement. Bulletin du Comité français de cartographie 191, 68–84 (2007)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: 28th International Conference on Neural Information Processing Systems (NIPS), pp. 91–99 (2015)
Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4634–4642 (2022)
Tarride, S., et al.: Large-scale genealogical information extraction from handwritten Quebec Parish records. Int. J. Doc. Anal. Recogn. 26(3), 255–272 (2023). https://doi.org/10.1007/s10032-023-00427-w
Tarride, S., Boillet, M., Kermorvant, C.: Key-value information extraction from full handwritten pages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14188, pp. 185–204. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_11
Vaswani, A., et al.: Attention is all you need. In: 31st International Conference on Neural Information Processing Systems (NIPS), pp. 6000–6010 (2017)
Acknowledgments
The Socface project is funded by the French National Research Agency (ANR) under the fund ANR-21-CE38-0013. This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011013446 made by GENCI and was partially funded by the ACADIIE project “Compréhension automatique des documents d’archives pour l’extraction d’informations individuelles” supported by a grant overseen by the French National Research Agency (ANR) as part of the France Relance program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Boillet, M., Tarride, S., Schneider, Y., Abadie, B., Kesztenbaum, L., Kermorvant, C. (2024). The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-70543-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70542-7
Online ISBN: 978-3-031-70543-4
eBook Packages: Computer ScienceComputer Science (R0)