Searching for Ground Truth: A Stepping Stone in Automating Genre Classification | SpringerLink
Skip to main content

Searching for Ground Truth: A Stepping Stone in Automating Genre Classification

  • Conference paper
Digital Libraries: Research and Development (DELOS 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4877))

Included in the following conference series:

Abstract

This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bagdanov, A., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings 6th International Conference on Document Analysis and Recognition, pp. 79–83 (2001) ISBN 0-7695-1263-1

    Google Scholar 

  2. Barbu, E., Heroux, P., Adam, S., Turpin, E.: Clustering document images using a bag of symbols representation. In: Proceedings 8th International Conference on Document Analysis and Recognition, pp. 1216-1220 (2005) ISBN ISSN 1520-5263

    Google Scholar 

  3. Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders. benchmark experiments on enron and sri corpora. In: Bekkerman, R., McCallum, A., Huang, G. (eds.) Technical Report IR-418, Centre for Intelligent Information Retrieval, UMASS (2004)

    Google Scholar 

  4. Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257 (1993)

    Google Scholar 

  5. Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press, New York (1995)

    Google Scholar 

  6. Boese, E.S.: Stereotyping the web: genre classification of web documents. Master’s thesis, Colorado State University (2005)

    Google Scholar 

  7. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  8. Chao, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data (2004), http://www.stat.berkeley.edu/~breiman/RandomForests/

  9. Curran, J., Clark, S.: Investigating GIS and Smoothing for Maximum Entropy Taggers. In: Proceedings Aunnual Meeting European Chapter of the Assoc. of Computational Linguistics, pp. 91–98 (2003)

    Google Scholar 

  10. Finn, A., Kushmerick, N.: Learning to classify documents according to genre. Journal of American Society for Information Science and Technology 57(11), 1506–1518 (2006)

    Article  Google Scholar 

  11. Giuffrida, G., Shek, E., Yang, J.: Knowledge-based metadata extraction from postscript file. In: Proceedings 5th ACM Intl. Conf. Digital Libraries, pp. 77–84. ACM Press, New York (2000)

    Google Scholar 

  12. Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: 3rd ACM/IEEECS Conf. Digital Libraries, pp. 37–48 (2003)

    Google Scholar 

  13. Karlgren, J., Cutting, D.: Recognizing text genres with simple metric using discriminant analysis. Proceedings 15th Conf. Comp. Ling. 2, 1071–1075 (1994)

    Article  Google Scholar 

  14. Ke, S.W., Bowerman, C.: Perc: A personal email classifier. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 460–463. Springer, Heidelberg (2006)

    Google Scholar 

  15. Kessler, G., Nunberg, B., Schuetze, H.: Automatic detection of text genre. In: Proceedings 35th Ann., pp. 32–38 (1997)

    Google Scholar 

  16. Kim, Y., Ross, S.: Genre classification in automated ingest and appraisal metadata. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 63–74. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Kim, Y., Webber, B.: Implicit reference to citations: A study of astronomy papers. Presentation at the 20th CODATA international Conference, Beijing, China. (2006), http://eprints.erpanet.org/paperid115

  18. Kim, Y., Ross, S.: Detecting family resemblance: Automated genre classification. Data Science 6, S172–S183 (2007), http://www.jstage.jst.go.jp/article/dsj/6/0/s172/_pdf

    Article  Google Scholar 

  19. Kim, Y., Ross, S.: The Naming of Cats: Automated genre classification. International Journal for Digital Curation 2(1) (2007), http://www.ijdc.net/./ijdc/article/view/24

  20. Marcus, M.P., Santorini, B., Mareinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2), 313–330 (1994)

    Google Scholar 

  21. Rauber, A., Müller-Kögler, A.: Integrating automatic genre analysis into digital libraries. In: Proceedings ACM/IEEE Joint Conf. Digital Libraries, Roanoke, VA, pp. 1–10 (2001)

    Google Scholar 

  22. Ross, S., Hedstrom, M.: Preservation research and sustainable digital libraries. International Journal of Digital Libraries, (2005) DOI: 10.1007/s00799-004-0099-3

    Google Scholar 

  23. Thoma, G.: Automating the production of bibliographic records. Technical report, Lister Hill National Center for Biomedical Communication, US National Library of Medicine (2001)

    Google Scholar 

  24. Witten, H.I., Frank, E.: Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Costantino Thanos Francesca Borri Leonardo Candela

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, Y., Ross, S. (2007). Searching for Ground Truth: A Stepping Stone in Automating Genre Classification. In: Thanos, C., Borri, F., Candela, L. (eds) Digital Libraries: Research and Development. DELOS 2007. Lecture Notes in Computer Science, vol 4877. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77088-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77088-6_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77087-9

  • Online ISBN: 978-3-540-77088-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics