Abstract
Nowadays, crime is a major threat to the society that affects the normal life of human beings all over the world. It is very important to make the world free from all aspects of crime activities. The main motivation of this work is to understand various crime patterns for avoiding and preventing the crime events to occur in future and save the world from such curse. Though research is going on for solving such problems, no work is noticed to handle the roughness or ambiguity that exists in the crime reports. The present work extracts all possible crime features from the crime reports and selects only the important features required for crime pattern analysis. For this purpose, it develops a purely supervised feature selection model integrating rough set theory and graph theory (spanning tree of a directed weighted graph). The crime reports are preprocessed, and crime features are extracted to represent each report as a feature vector (i.e., a set of distinct crime features). For crime pattern analysis, the main objective of our work, all extracted features are not necessarily essential, rather a minimal subset of relevant features are sufficient. Thus, feature selection is the main contribution in the paper that not only enhances the efficiency of subsequent mining process but also increases its correctness. The rough set theory-based relative indiscernibility relation is defined to measure the similarity between two features relative to the crime type. Based on the similarity score, a weighted and directed graph has been constructed that comprises the features as nodes and the inverse of the similarity score representing the similarity of feature v to u as the weight of the corresponding edge. Then, a minimal spanning tree (termed as rough-spanning tree) is generated using Edmond/Chu–Liu algorithm from the constructed directed graph and the importance of the nodes in the spanning tree is measured using the weights of the edges and the degrees (in-degrees and out-degrees) of the nodes in the spanning tree. Finally, a feature selection algorithm has been proposed that selects the most important node and remove it from the spanning tree iteratively until the modified graph (not necessarily a tree) becomes a null graph. The selected nodes are considered as the important feature subset sufficient for crime pattern analysis. The method is evaluated using various statistical measures and compared with related state-of-the-art methods to express its effectiveness in crime pattern analysis. The Wilcoxon rank-sum test, a popular nonparametric version of the two-sample t test, is done to express that the proposed supervised model is statistically significant.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Throughout the paper, the term feature is used instead of conditional feature.
As the relative indiscernibility relation is defined on conditional features relative to decision feature, so conditional features more correlated to decision feature are selected. Thus, the conditional features independent to each other and correlated to decision feature are selected, which is the main objective of our feature selection algorithm.
References
Abdi H, Williams JL (2010) Principal component analysis. WIREs Comput Stat 2(4):433–459
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Bazlamac CF, Hindi KS (2001) Minimum-weight spanning tree algorithms a survey and empirical study. Comput Oper Res 28(8):767–785
Blondel VD, Jean-Loup Guillaume RL, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):1–12
Broutin N, Devroye L, McLeish E (2008) Note on the structure of Kruskal’s algorithm. Algorithmica 56(2):141
Chu YJ, Liu TH (1965) On the shortest arborescence of a directed graph. Sci Sin 14:1396–1400
Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Syst 1695:1–9. http://igraph.org
Das AK, Sengupta S, Bhattacharyya S (2018) A group incremental feature selection for classification using rough set theory based genetic algorithm. Appl Soft Comput 65(C):400–411
Das P, Das AK (2017) 8th international conference on computing, communication and networking technologies, pp 1–6
Deo N (1974) Graph theory with applications to engineering and computer science. Prentice-Hall Inc, Upper Saddle River
Fazayeli F, Wang L, Mandziuk J (2008) Feature selection based on the rough set theory and expectation-maximization clustering algorithm. In: Chan C-C, Grzymala-Busse JW, Ziarko WP (eds) Rough sets and current trends in computing. Springer, Berlin, pp 272–282
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Hu XT, Lin TY, Han J (2003) A new rough sets model based on database systems. In: Wang G, Liu Q, Yao Y, Skowron A (eds) Rough sets, fuzzy sets, data mining, and granular computing. Springer, Berlin, pp 114–121
Huda RK, Banka H (2018) Efficient feature selection and classification algorithm based on PSO and rough sets. Neural Comput Appl. https://doi.org/10.1007/s00521-017-3317-9
Jalil MMA, Ling CP, Noor NMM, Mohd F (2017a) Knowledge representation model for crime analysis. Procedia Comput Sci 116:484–491
Jalil MMA, Mohd F, Noor NMM (2017b) A comparative study to evaluate filtering methods for crime data feature selection. Procedia Comput Sci 116:113–120
Janeela Theresa MM, Joseph Raj V (2016) A maximum spanning tree-based dynamic fuzzy supervised neural network architecture for classification of murder cases. Soft Comput 20(6):2353–2365
Edmonds J (1967) Optimum branchings. J Res Natl Bureau Stand 71:233–240
Keerthika T, Premalatha K (2016) Rough set reduct algorithm based feature selection for medical domain. J Chem Pharm Sci 9(2):896–902
Lehrmann A, Huber M, Polatkan AC, Pritzkau A, Nieselt K (2013) Visualizing dimensionality reduction of systems biology data. Data Min Knowl Discov 27(1):146–165
Loper E, Bird S (2002) NLTK: the natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1, pp 63–70
Girvan M (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99(12):7821–7826
Maji P, Paul S (2011) Rough set based maximum relevance-maximum significance criterion and gene selection from microarray data. Int J Approx Reason 52(3):408–426
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. CoRR. arXiv:abs/1301.3781:1–12
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. CoRR. arXiv:abs/1310.4546:1–9
Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582
Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356
Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Rosvall M, Axelsson D, Bergstrom CT (2009) The map equation. Eur Phys J Spec Top 178(1):13–23
Sabu MK (2018) A rough set based feature selection approach for the prediction of learning disabilities. Int J Adv Comput Eng Netw 2(12):43–48
Sengupta S, Das AK (2012) Single reduct generation based on relative indiscernibility of rough set theory. Int J Soft Comput 3(1):107–119
Shalabi LA (2017) Perceptions of crime behavior and relationships: rough set based approach. Int J Comput Sci Inf Secur 15(3):413–420
Singh B, Sankhwar JS, Vyas OP (2014) Optimization of feature selection method for high dimensional data using fisher score and minimum spanning tree. In: 2014 annual IEEE India conference (INDICON), pp 1–6
Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 25(1):1–14
Steven Bird EK, Loper E (2009) Natural language processing in python. O’Reilly Media, Sebastopol
JeraldBeno TR, K M (2012) Dimensionality reduction: rough set based feature reduction. Int J Sci Res Publ 2(9):1–6
Taha K, Yoo PD (2017) Using the spanning tree of a criminal network for identifying its leaders. IEEE Trans Inf Forensics Secur 12(2):445–453
Weng J, Young DS (2017) Some dimension reduction strategies for the analysis of survey data. J Big Data 4(1):43
Yager RR, Alajlan N (2015) Dempster-shafer belief structures for decision making under uncertainty. Knowl Based Syst 80(C):58–66
Yang HH, Moody J (1999) Feature selection based on joint mutual information. In: Proceedings of international ICSC symposium on advances in intelligent data analysis, pp 22–25
Alapati Yaswanth Kumar, Sindhu SSK (2015) Relevant feature selection from high-dimensional data using MST based clustering. Int J Emerg Trends Sci Technol 2(3):1997–2001
Zadeh L (1965) Fuzzy sets. Inf Control 8(3):338–353
Zaher AA, Berretta R, Arefin AS, Moscato P (2015) Proceedings of the 13th Australasian data mining conference (AusDM 2015). In: FSMEC: a feature selection method based on the minimum spanning tree and evolutionary computation, pp 129–139
Zhang M, Yao JT (2004) A rough sets based approach to feature selection. In: IEEE annual meeting of the fuzzy information, vol 1, pp 434–439
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that this manuscript has no conflict of interest with any other published source and has not been published previously (partly or in full). No data have been fabricated or manipulated to support our conclusions.
Rights and permissions
About this article
Cite this article
Das, P., Das, A.K. & Nayak, J. Feature selection generating directed rough-spanning tree for crime pattern analysis. Neural Comput & Applic 32, 7623–7639 (2020). https://doi.org/10.1007/s00521-018-3880-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3880-8