Abstract
With the rapid growth of digital libraries, e-governance and Internet applications, huge volume of documents are being generated, communicated and archived in the compressed form to provide better storage and transfer efficiencies. In such a large repository of compressed documents, the frequently used operations like keyword searching and document retrieval have to be carried out after decompression and subsequently with the help of an OCR. Therefore developing keyword spotting technique directly in compressed documents is a potential and challenging research issue. In this backdrop, the paper presents a novel approach for searching keywords directly in run-length compressed documents without going through the stages of decompression and OCRing. The proposed method extracts simple and straightforward font size invariant features like number of run transitions and correlation of runs over the selected regions of test words, and matches with that of the user queried word. In the subsequent step, based on the matching score, the keywords are spotted in the compressed document. The idea of decompression-less and OCR-less word spotting directly in compressed documents is the major contribution of this paper. The method is experimented on a data set of compressed documents and the preliminary results obtained validate the proposed idea.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, S., Li, L., and Tan, C. L. Keyword spotting in document images through word shape coding. International Conference on Document Analysis and Recognition (ICDAR) (2009), 331–335.
CCITT-Recommedation (T.4). Standardization of group 3 facsimile apparatus for document transmission, terminal equipments and protocols for telematic services, vol. vii, fascicle, vii.3, geneva. Tech. rep., 1985.
CCITT-Recommedation (T.6). Standardization of group 4 facsimile apparatus for document transmission, terminal equipments and protocols for telematic services, vol. vii, fascicle, vii.3, geneva. Tech. rep., 1985.
Chen, F. R., Bloomberg, D. S., and Wilcox, L. D. Detection and location of multicharacter sequences in lines of imaged text. Journal of Electonic Imaging 5, 1 (January 1996), 37–49.
Doermann, D. The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding 70, 3 (1998), 287–298.
Hull, J. J. Document matching on ccitt group 4 compressed images. SPIE Conference on Document Recognition IV (Feb 1997), 8–14.
Hull, J. J., and Cullen, J. Document image similarity and equivalence detection. International Conference on Document Analysis and Recognition (ICDAR) 1 (1997), 308–312.
Javed, M., Nagabhushan, P., and Chaudhuri, B. B. Extraction of projection profile, run-histogram and entropy features straight from run-length compressed documents. 2nd IAPR Asian Conference on Pattern Recognition (ACPR) (November 2013), 813–817.
Javed, M., Nagabhushan, P., and Chaudhuri, B. B. Automatic detection of font size straight from run length compressed text documents. IJCSIT 5, 1 (February 2014), 818–825.
Javed, M., Nagabhushan, P., and Chaudhuri, B. B. Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain. 13th International Conference on Document Analysis and Recognition (ICDAR) (2015), 1–5.
Javed, M., Nagabhushan, P., and Chaudhuri, B. B. A direct approach for word and character segmentation in run-length compressed documents and its application to word spotting. 13th International Conference on Document Analysis and Recognition (ICDAR) (2015), 216–220.
Lu, Y., and Tan, C. L. Document retrieval from compressed images. Pattern Recognition 36 (2003), 987–996.
Lu, Y., and Tan, C. L. Word searching in ccitt group 4 compressed document images. International Conference on Document Analysis and Recognition (ICDAR) (2003), 467–471.
Murugappan, A., Ramachandran, B., and Dhavachelvan, P. A survey of keyword spotting techniques for printed document images. Artificial Intelligence Review 35, 2 (2011), 119–136.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Singapore
About this paper
Cite this paper
Javed, M., Nagabhushan, P., Chaudhuri, B.B. (2017). Spotting of Keyword Directly in Run-Length Compressed Documents. In: Raman, B., Kumar, S., Roy, P., Sen, D. (eds) Proceedings of International Conference on Computer Vision and Image Processing. Advances in Intelligent Systems and Computing, vol 459. Springer, Singapore. https://doi.org/10.1007/978-981-10-2104-6_33
Download citation
DOI: https://doi.org/10.1007/978-981-10-2104-6_33
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2103-9
Online ISBN: 978-981-10-2104-6
eBook Packages: EngineeringEngineering (R0)