Abstract
We consider the task of content based analysis and categorization in large-scale historical book scanning projects. Mixed content, deprecated language, noise and unexpected distortions suggest an image based approach. The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents. In order to incorporate spatial information into the bag of features approach we consider three methods of spatial verification. An approach based on comparison of statistical properties of local keypoint properties such as size orientation and scale showed comparable quality in content comparison while being computationally much more efficient. Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baluja, S., Covell, M.: Finding images and line drawings in document-scanning systems. In: Proc. Intl. Conf. on Doc. Anal. and Retrieval, ICDAR 2009 (2009)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Chaudhury, K., Jain, A., Thirthala, S., Sahasranaman, V., Saxena, S., Mahalingam, S.: Google newspaper search - image processing and analysis pipeline. In: Proc. Intl. Conf. on Doc. Analysis and Recognition, ICDAR 2009 (2009)
Chum, O., Matas, J.: Unsupervised discovery of co-occurrence in sparse high dimensional data. In: Proc. Comp. Vis. and Pat. Rec., CVPR 2010 (2010)
Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV 2004 (2004)
Doermann, D., Li, H., Kia, O.: The detection of duplicates in document image databases. Image and Vision Computing 16(12-13), 907–920 (1998)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. Conf. on Knowledge Discovery and Data Mining, KDD 1996 (1996)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981)
Garz, A., Sablatnig, R., Diem, M.: Layout analysis for historic manuscripts using SIFT features. In: Proc. Intl. Conf. on Doc. Anal. and Rec., ICDAR 2011 (2011)
Hazelhoff, L., Creusen, I., van de Wouw, D., de With, P.H.N.: Large-scale classification of traffic signs under real-world conditions. In: Proc. SPIE Electronic Imaging: Algorithms and Systems VI (2012)
Huber-Mörk, R., Schindler, A.: Quality assurance for document image collections in digital preservation. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P., Zemčík, P. (eds.) ACIVS 2012. LNCS, vol. 7517, pp. 108–119. Springer, Heidelberg (2012)
Huber-Mörk, R., Schindler, A., Schlarb, S.: Duplicate detection for quality assurance of document image collections. In: Proc. Conf. on Digital Preservation, iPres 2012 (2012)
Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: Proc. Computer Vision and Pattern Recognition, CVPR 2009 (2009)
Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: Proc. Intl. Conf. on Multimedia, MULTIMEDIA 2004 (2004)
Knopp, J., Sivic, J., Pajdla, T.: Avoiding confusing features in place recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 748–761. Springer, Heidelberg (2010)
Langley, A., Bloomberg, D.S.: Google books: making the public domain universally accessible. In: Proc. of SPIE, Doc. Rec. and Retrieval XIV (2007)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Comput. Vision 60(2), 91–110 (2004)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 7th edn. Cambridge University Press (2008)
Ramachandrula, S., Joshi, G.D., Noushath, S., Parikh, P., Gupta, V.: PaperDiff: A script independent automatic method for finding the text differences between two document images. In: Proc. Intl. Workshop on Docu. Anal. Syst. (2008)
Rao, J.S.: Bahadur efficiencies of some tests for uniformity on the circle. Ann. Math. Statist. 43(2), 468–479 (1972)
Schilcher, U., Gyarmati, M., Bettstetter, C., Chung, Y.W., Kim, Y.H.: Measuring inhomogeneity in spatial distributions. In: Proc. Vehicular Technology Conference, VTC 2008 (2008)
van Beusekom, J., Shafait, F., Breuel, T.M.: Image-matching for revision detection in printed historical documents. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 507–516. Springer, Heidelberg (2007)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004)
Wu, X., Zhao, W.-L., Ngo, C.-W.: Near-duplicate keyframe retrieval with visual keywords and semantic context. In: Proc. Conf. on Image and Video Retrieval, CIVR 2007 (2007)
Xu, D., Cham, T.J., Yan, S., Duan, L., Chang, S.-F.: Near duplicate identification with spatially aligned pyramid matching. IEEE Trans. Circuits Syst. Video Techn. 20(8), 1068–1079 (2010)
Zhang, S., Tian, Q., Hua, G., Huang, Q., Li, S.: Descriptive visual words and visual phrases for image applications. In: Proc. Intl. Conf. on Multimedia, MULTIMEDIA 2009 (2009)
Zhao, W.-L., Ngo, C.-W., Tan, H.-K., Wu, X.: Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Trans. Pat. Anal. Mach. Intell. 9(5), 1037–1048 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huber-Mörk, R., Schindler, A. (2013). An Image Based Approach for Content Analysis in Document Collections. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2013. Lecture Notes in Computer Science, vol 8034. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41939-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-41939-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41938-6
Online ISBN: 978-3-642-41939-3
eBook Packages: Computer ScienceComputer Science (R0)