Abstract
Document representation is a key problem in document analysis and processing tasks, such as document classification, clustering and information retrieval. Especially for unstructured text data, the use of a suitable document representation method would affect the performance of the subsequent algorithms for applications and research. In this paper, we propose a novel document representation method called the conditional co-occurrence degree matrix document representation method (CCODM), which is based on word co-occurrence. CCODM not only considers the co-occurrence of terms but also considers the conditional dependencies of terms in a specific context, which leads to more available and useful structural and semantic information being retained from the original documents. Extensive experimental classification results with different supervised and unsupervised feature selection methods show that the proposed method, CCODM, achieves better performance than the vector space model, latent Dirichlet allocation, the general co-occurrence matrix representation method and the document embedding method.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768. doi:10.1016/j.eswa.2011.09.160
Benabdeslem K, Elghazel H, Hindawi M (2016) Ensemble constrained laplacian score for efficient and robust semi-supervised feature selection. Knowl Inf Syst 49(3):1161–1185. doi:10.1007/s10115-015-0901-0
Bengio Y, Courville A, Vincent P (2014) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. doi:10.1109/TPAMI.2013.50
Bengio Y, Schwenk H, Sencal J, Morin F, Gauvain J (2003) Neural probabilistic language models. J Mach Learn Res 3(6):1137–1155, doi:10.1162/153244303322533223, http://dl.acm.org/citation.cfm?id=944919.944966
Bernotas M, Laurutis R (2007) The peculiarities of the text document representation, using ontology and tagging-based clustering technique. J Inf Technol Control 36(2):217–220
Bettina G, Kurt H (2017) Topicmodels: an R package for fitting topic models. Version 0.2-6. doi:10.18637/jss.v040.i13
Bhushan S, Danti A (2017) Classification of text documents based on score level fusion approach. Pattern Recognit Lett 94:118–126. doi:10.1016/j.patrec.2017.05.003
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022, http://dl.acm.org/citation.cfm?id=944919.944937
Boulares M, Jemni M (2016) Learning sign language machine translation based on elastic net regularization and latent semantic analysis. Artif Intell Rev 46(2):145–166. doi:10.1007/s10462-016-9460-3
Bullinaria J, Levy J (2012) Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods 44(3):890–907. doi:10.3758/s13428-011-0183-8
Cambria E, Gastaldo P, Bisio F, Zunino R (2015) An ELM-based model for affective analogical reasoning. Neurocomputing 149:443–455. doi:10.1016/j.neucom.2014.01.064
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941. doi:10.1109/TKDE.2014.2313872
Du Y, Liu W, Lv X, Peng G (2015) An improved focused crawler based on semantic similarity vector space model. Appl Soft Comput 36:392–407. doi:10.1016/j.asoc.2015.07.026
Farahat A, Kamel M (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393. doi:10.1007/s10115-010-0367-z
Franco-Salvador M, Gupta P, Rosso P, Banchs R (2016) Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl Based Syst 111:87–99. doi:10.1016/j.knosys.2016.08.004
Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209. doi:10.1016/j.asoc.2016.02.015
Huang H, Kuo Y (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. IEEE Trans Fuzzy Syst 18(6):1098–1111. doi:10.1142/S0218001411008890
Ibrahim O, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061. doi:10.1007/s00500-015-1935-7
Jin L, Gong W, Fu W, Wu H (2015) A text classifier of english movie reviews based on information gain. In: The 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence, pp 454–457. doi:10.1109/ACIT-CSI.2015.86
Johnson-laird P, Oatley K (1989) The language of emotions: an analysis of a semantic field. Cogn Emot 3(3):81–123. doi:10.1080/02699938908408075
Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71. doi:10.1016/j.knosys.2008.06.002
Lau R, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43. doi:10.1109/MCI.2013.2291689
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196
Li J, Li J, Fu X, Masud M, Huang J (2016) Learning distributed word representation with multi-contextual mixed embedding. Knowl Based Syst 106:220–230. doi:10.1016/j.knosys.2016.05.045
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
Liaw A, Wiener M (2015) Package ’randomForest’. Breiman and Cutlers random forests for classification and regression. Version 4.6-12. https://www.stat.berkeley.edu/~breiman/RandomForests/
Liu Q, Zhang H, Yu H, Cheng X (2004) Chinese lexical analysis using cascaded hidden Markov model. J Comput Res Dev 41(8):1421–1429
Liu Z, Yu W, Deng Y, Bian Z (2010) A feature selection method for document clustering based on part-of-speech and word co-occurrence. In: 2010 Seventh international conference on fuzzy systems and knowledge discovery, vol 5, pp 2331–2334. doi:10.1109/FSKD.2010.5569827
Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199. doi:10.1016/j.knosys.2016.12.013
Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr J 14(2):178–203. doi:10.1007/s10791-010-9141-9
Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98. doi:10.1016/j.ins.2015.10.038
Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space, pp 1–12. arXiv preprint arXiv:1301.3781
Neubig G, Watanabe T (2016) Optimization for statistical machine translation: a survey. Comput Linguist 42(1):1–54. doi:10.1162/COLI_a_00241
Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 427–436, http://arxiv.org/abs/1412.1897
Pessiot J, Kim Y, Amini M, Gallinari P (2010) Improving document clustering in a learned concept space. Inf Process Manag 46(2):180–192. doi:10.1016/j.ipm.2009.09.007
Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. IEEE Trans Knowl Data Eng 23(7):961–976. doi:10.1109/TKDE.2010.27
Radim Ř, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50
Ravi D, Bober M, Farinella G, Guarnera M, Battiato S (2016) Semantic segmentation of images exploiting DCT based features and random forest. Pattern Recognit 52:260–273. doi:10.1016/j.patcog.2015.10.021
Ren F, Sohrab M (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. doi:10.1016/j.ins.2013.02.029
Rule A, Cointet J, Bearman P (2015) Lexical shifts, substantive changes, and continuity in State of the Union discourse. Proc Natl Acad Sci USA 112(35):10,837–10,844. doi:10.1073/pnas.1512221112
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Tang G, Xia Y, Sun J, Zhang M, Zheng TF (2015) Statistical word sense aware topic models. Soft Comput 19(1):13–27
Trovati M, Bessis N (2016) An influence assessment method based on co-occurrence for topologically reduced big data sets. Soft Comput 20(5):2021–2030. doi:10.1007/s00500-015-1621-9
Vila M, Bardera A, Feixas M, Sbert M (2011) Tsallis mutual information for document classification. Entropy 13(9):1694–1707. doi:10.3390/e13091694
Wang H (2015) Study on the application of feature selection for big text data using expected cross entropy. J Inf Comput Sci 12(18):6835–6843. doi:10.12733/jics20150077
Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45(11):1–10. doi:10.1016/j.patrec.2014.02.013
Wang D, Shen H, Truong Y (2016a) Efficient dimension reduction for high-dimensional matrix-valued data. Neurocomputing 190:25–34. doi:10.1016/j.neucom.2015.12.096
Wang D, Zhang H, Liu R, Liu X, Wang J (2016b) Unsupervised feature selection through Gram–Schmidt orthogonalization—a word co-occurrence perspective. Neurocomputing 173(P3):845–854. doi:10.1016/j.neucom.2015.08.038
Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28. doi:10.1016/j.ins.2017.02.009
Xiao Q, Song R (2017) Motion retrieval based on motion semantic dictionary and HMM inference. Soft Comput 21(1):255–265. doi:10.1007/s00500-016-2059-4
Xu H, Zhang F, Wang W (2015) Implicit feature identification in Chinese reviews using explicit topic mining model. Knowl Based Syst 76:166–175. doi:10.1016/j.knosys.2014.12.012
Yan H, Yang J (2014) Joint laplacian feature weights learning. Pattern Recognit 47(3):1425–1432. doi:10.1016/j.patcog.2013.09.038
Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of fourteenth international conference on machine learning (ICML), vol 4, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Zheng Y, Han W, Zhu C (2014) A novel feature selection method based on category distribution and phrase attributes. In: International conference on trustworthy computing and services (ISCTCS), Berlin, Heidelberg, pp 25–32. doi:10.1007/978-3-662-47401-3_4
Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl Based Syst 95:1–11. doi:10.1016/j.knosys.2015.11.010
Acknowledgements
This work was supported in part by the Natural Science Foundation of China [Grant Numbers 71771034, 71501023, 71421001] and the Open Program of State Key Laboratory of Software Architecture [Item Number SKLSAOP1703]. Besides, We are very grateful to Dr. Deqing Wang (Wang et al. 2016b) for giving us all the code of RP-GSO and Dr. Xiangzhu Meng for guiding us to do all the experiments on doc2vec. We would like to thank the anonymous reviewers for their constructive comments on this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Wei Wei, Chonghui Guo and Lin Tang have received research grants from Neusoft Corporation (Shenyang, PR China). Jingfeng Chen and Leilei Sun declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Communicated by V. Loia.
Appendix
Appendix
Term frequency (TF) The term frequency of term in corpus is computed by
where \(freq(t|d^k)\) represents the occurrence times of term in document \(d^k\) and N is the number of documents in corpus.
Document frequency (DF) The document frequency of term in corpus is designed as

where the numerator represents the number of documents containing the term t in corpus and \(T^k\) is the feature set of document \(d^k\).
Information gain (IG) The information gain of term in corpus is defined as
where \(c_i\) represents the \(i\hbox {th}\) category document in corpus, C is the total number of categories label in corpus and \(\bar{t}\) means that term t does not occur. \(p(c_i)\) is the probability of the category \(c_i\) in corpus, p(t) is the probability of documents containing term t in corpus, and \(p(\bar{t})\) is the probability of documents that term t does not occur. \(p(c_i | t)\) is the conditional probability of the \(i\hbox {th}\) category given that term t occurs and \(p(c_i | \bar{t})\) is the conditional probability of the \(i\hbox {th}\) category given that term t does not occur. Besides, it is worth emphasizing that all the definitions of these symbols appeared in the following are consistent with the definitions in IG.
For the actual convenience of calculation, we define \(A_i(t)\) as the number of documents containing term t and belonging to category \(c_i\), \(B_i(t)\) as the number of documents belonging to category \(c_i\) but not containing term t, \(C_i(t)\) as the number of documents containing term t but not belonging to category \(c_i\). Therefore, the information gain of term t in corpus is calculated by
Mutual information (MI) The mutual information between term t and category \(c_i\) is formulated by
where \(p(t, c_i )\) is the joint probability of documents containing term t and belonging to category \(c_i \). Moreover, the MI of term t to the whole corpus can be expressed in terms of the average value of the MI of term with each category in corpus, which is formulated by
In order to calculated conveniently, we give the same definition of \(A_i(t)\), \(B_i(t)\) and \(C_i(t)\) as in the calculation of IG above. Therefore, the MI of term t in corpus is formulated by
Expected cross-entropy (ECE) The expected cross-entropy of term is defined as
As the definition of \(A_i(t)\), \(B_i(t)\) and \(C_i(t)\) in the calculation of IG and MI above, the MI of term t in corpus is redesigned conveniently as
Random projection and Gram–Schmidt orthogonalization (RP-GSO) The original paper (Wang et al. 2016b) gives a detailed description of the unsupervised feature selection model, RP-GSO, and here we will not make a copy of this model any more.
Rights and permissions
About this article
Cite this article
Wei, W., Guo, C., Chen, J. et al. CCODM: conditional co-occurrence degree matrix document representation method. Soft Comput 23, 1239–1255 (2019). https://doi.org/10.1007/s00500-017-2844-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-017-2844-8