Abstract
This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bagdanov, A., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings 6th International Conference on Document Analysis and Recognition, pp. 79–83 (2001) ISBN 0-7695-1263-1
Barbu, E., Heroux, P., Adam, S., Turpin, E.: Clustering document images using a bag of symbols representation. In: Proceedings 8th International Conference on Document Analysis and Recognition, pp. 1216-1220 (2005) ISBN ISSN 1520-5263
Bekkerman, R., McCallum, A., Huang, G.: Automatic categorization of email into folders. benchmark experiments on enron and sri corpora. In: Bekkerman, R., McCallum, A., Huang, G. (eds.) Technical Report IR-418, Centre for Intelligent Information Retrieval, UMASS (2004)
Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257 (1993)
Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press, New York (1995)
Boese, E.S.: Stereotyping the web: genre classification of web documents. Master’s thesis, Colorado State University (2005)
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Chao, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data (2004), http://www.stat.berkeley.edu/~breiman/RandomForests/
Curran, J., Clark, S.: Investigating GIS and Smoothing for Maximum Entropy Taggers. In: Proceedings Aunnual Meeting European Chapter of the Assoc. of Computational Linguistics, pp. 91–98 (2003)
Finn, A., Kushmerick, N.: Learning to classify documents according to genre. Journal of American Society for Information Science and Technology 57(11), 1506–1518 (2006)
Giuffrida, G., Shek, E., Yang, J.: Knowledge-based metadata extraction from postscript file. In: Proceedings 5th ACM Intl. Conf. Digital Libraries, pp. 77–84. ACM Press, New York (2000)
Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: 3rd ACM/IEEECS Conf. Digital Libraries, pp. 37–48 (2003)
Karlgren, J., Cutting, D.: Recognizing text genres with simple metric using discriminant analysis. Proceedings 15th Conf. Comp. Ling. 2, 1071–1075 (1994)
Ke, S.W., Bowerman, C.: Perc: A personal email classifier. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 460–463. Springer, Heidelberg (2006)
Kessler, G., Nunberg, B., Schuetze, H.: Automatic detection of text genre. In: Proceedings 35th Ann., pp. 32–38 (1997)
Kim, Y., Ross, S.: Genre classification in automated ingest and appraisal metadata. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 63–74. Springer, Heidelberg (2006)
Kim, Y., Webber, B.: Implicit reference to citations: A study of astronomy papers. Presentation at the 20th CODATA international Conference, Beijing, China. (2006), http://eprints.erpanet.org/paperid115
Kim, Y., Ross, S.: Detecting family resemblance: Automated genre classification. Data Science 6, S172–S183 (2007), http://www.jstage.jst.go.jp/article/dsj/6/0/s172/_pdf
Kim, Y., Ross, S.: The Naming of Cats: Automated genre classification. International Journal for Digital Curation 2(1) (2007), http://www.ijdc.net/./ijdc/article/view/24
Marcus, M.P., Santorini, B., Mareinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2), 313–330 (1994)
Rauber, A., Müller-Kögler, A.: Integrating automatic genre analysis into digital libraries. In: Proceedings ACM/IEEE Joint Conf. Digital Libraries, Roanoke, VA, pp. 1–10 (2001)
Ross, S., Hedstrom, M.: Preservation research and sustainable digital libraries. International Journal of Digital Libraries, (2005) DOI: 10.1007/s00799-004-0099-3
Thoma, G.: Automating the production of bibliographic records. Technical report, Lister Hill National Center for Biomedical Communication, US National Library of Medicine (2001)
Witten, H.I., Frank, E.: Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, Y., Ross, S. (2007). Searching for Ground Truth: A Stepping Stone in Automating Genre Classification. In: Thanos, C., Borri, F., Candela, L. (eds) Digital Libraries: Research and Development. DELOS 2007. Lecture Notes in Computer Science, vol 4877. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77088-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-77088-6_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77087-9
Online ISBN: 978-3-540-77088-6
eBook Packages: Computer ScienceComputer Science (R0)