Page Classification for Meta-data Extraction from Digital Collections | SpringerLink
Skip to main content

Page Classification for Meta-data Extraction from Digital Collections

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2113))

Included in the following conference series:

Abstract

Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis of their physical layout, that is described by means of a hierarchicalrepresen tation: the Modified X-Y tree. The Modified X-Y tree describes a document by means of a recursive segmentation by alternating horizontaland verticalcuts along either spaces or lines. Each internal node of the tree represents a separator (a space or a line), whereas leaves represent regions in the page or separating lines. The Modified X-Y tree is built starting from a symbolic description of the document, instead of dealing directly with the image. The tree is afterwards encoded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificialneuralnet work that is trained to classify document images. The system is applied to the classification of documents belonging to Digital Libraries, examples of classes taken into account for a journal are “title page”, “index”, “regular page”. Some tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. S.L. Taylor, R. Fritzson, and J. Pastor, “Extraction of data from preprinted forms,” Machine Vision and Applications, vol. 5,no. 5, pp. 211–222, 1992.

    Article  Google Scholar 

  2. Y. Ishitani, “Flexible and robust model matching based on association graph for form image understanding,” Pattern Analysis and Applications, vol. 3,no. 2, pp. 104–119, 2000.

    Article  Google Scholar 

  3. A. Dengel and F. Dubiel, “Clustering and classification of document strcture-a machine learning approach,” in Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 587–591, 1995.

    Google Scholar 

  4. J. Hu, R. Kashi, and G. Wilfong, “Document image layout comparison and classification,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, pp. 285–288, 1999.

    Google Scholar 

  5. C. Shin and D. Doermann, “Classification of document page images based on visual similarity of layout structures,” in SPIE 2000, pp. 182–190, 2000.

    Google Scholar 

  6. F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Structured document segmentation and representation by the modified X-Y tree,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, pp. 563–566, 1999.

    Google Scholar 

  7. G. Nagy and S. Seth, “Hierarchical representation of optically scanned documents,” in Proceedings of the International Conference on Pattern Recognition, pp. 347–349, 1984.

    Google Scholar 

  8. G. Nagy and M. Viswanathan, “Dualrepresentation of segmented technicaldo cuments,” in Proceedings of the First International Conference on Document Analysis and Recognition, pp. 141–151, 1991.

    Google Scholar 

  9. T.M. Ha and H. Bunke, “Model-based analysis and understanding of check forms,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 8,no.5, pp. 1053–1081, 1994.

    Article  Google Scholar 

  10. J. Ha, R. Haralick, and I. Phillips, “Recursive X-Y cut using bounding boxes of connected components,” in Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 952–955, 1995.

    Google Scholar 

  11. A. Amin, H. Alsadoun, and S. Fischer, “Hand-printed arabic character recognition system using an artificialnet work,” Pattern Recognition, vol. 29,no. 4, pp. 663–675, 1996.

    Article  Google Scholar 

  12. R. Brugger, A. Zramdini, and R. Ingold, “Modeling documents for structure recognition using generalized N-grams,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp. 56–60, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cesarini, F., Lastri, M., Marinai, S., Soda, G. (2001). Page Classification for Meta-data Extraction from Digital Collections. In: Mayr, H.C., Lazansky, J., Quirchmayr, G., Vogel, P. (eds) Database and Expert Systems Applications. DEXA 2001. Lecture Notes in Computer Science, vol 2113. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44759-8_10

Download citation

  • DOI: https://doi.org/10.1007/3-540-44759-8_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42527-4

  • Online ISBN: 978-3-540-44759-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics