Abstract
Many crime reports are available online in various blogs and Newswire. Though manual annotation of these massive reports is quite tedious for crime data analysis, it gives an overall crime scenario of all over the world. This motivates us to propose a framework for crime data analysis based on the online reports. Initially, the method extracts the crime reports and identifies named entities. The intermediate sequence of context words between every consecutive pair of named entities is termed as a crime vector that provides relationships between the entities. The feature vectors for each entity pair are generated from these crime vectors using the Word2Vec model. The paper considers three different types of named entity pairs to facilitate the major crime data analysis task, and for each type, similarity between every pair of entities is measured using respective feature vectors. For each type of named entity pair, a separate weighted graph is generated with entity pairs as vertices and similarity score between them as the weight of the corresponding edge. Then, Infomap, a graph-based clustering algorithm, is applied to obtain optimal set of clusters of entity pairs and a representative entity pair of each cluster. Each cluster is labelled by the relationship, represented by the crime vector, of its representative entity pair. In reality, all the entity pairs in a cluster may not reflect contextual similarity with their representative entity pair. So the clusters are further partitioned into subclusters based on WordNet-based path similarity measure which makes the entity pairs in each subcluster more contextually similar compared to their original cluster. These subclusters provide us various statistical crime information over the time period. The method is experimented only using the crime reports related to crime against women in India. The experimental results demonstrate the effectiveness and superiority of the method compared to others for analysing the crime data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries
An J, Kim H (2018) A data analytics approach to the cybercrime underground economy. IEEE Access 6:26636–26652
Arbelaitz O, Gurrutxaga I, Muguerza J, Prez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256
Arulanandam R, Savarimuthu BTR, Purvis MA (2014) Extracting crime information from online newspaper articles. In: Second Australasian Web Conference (AWC 2014), vol 155, pp 31–38
Basili R, Giannone C, Del Vescovo C, Moschitti A, Naggar P (2009) Kernel-based relation extraction for crime investigation. In: AI*IA, Citeseer, pp 161–171
Bergmanis T, Goldwater S (2018) Context sensitive neural lemmatization with lematus. In: 16th annual conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1391–1400
Bird S, Klein E, Loper E (2009) Natural language processing in python. O’Reilly Media
Brin S (1999) Extracting patterns and relations from the World Wide Web. In: International workshop on the world wide web and databases, pp 172–183
Caliski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27
Chau M, Xu JJ, Chen H (2002) Extracting meaningful entities from police narrative reports. In: Annual national conference on digital government research, pp 1–5
Chen H, Chung W, Xu JJ, Wang G, Qin Y, Chau M (2004) Crime data mining: a general framework and some examples. IEEE Comput Soc 37(4):50–56
Cunningham H (2002) Gate, a general architecture for text engineering. Comput Humanit 36(2):223–254
Das P, Das AK (2017) An application of strength pareto evolutionary algorithm for feature selection from crime data. In: 8th international conference on computing, communication and networking technologies, pp 1–6
Das P, Das AK (2018) Crime pattern analysis by identifying named entities and relation among entities. In: Advanced computational and communication paradigms, pp 75–84
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Doddington G, Mitchell A, Przybocki M, Ramshaw L, Strassel S, Weischedel R (2004) The automatic content extraction (ace) program tasks, data, and evaluation. In: Proceedings of the fourth international conference on language resources and evaluation (LREC-2004), pp 837–840
Fellbaum C (1998) WordNet: an electronic lexical database. Bradford Books, Cambridge
Grishman R, Sundheim B (1996) Message understanding conference-6: a brief history. In: Proceedings of the 16th conference on computational linguistics, vol 1, pp 466–471
Hasegawa T, Sekine S, Grishman R (2004) Discovering relations among named entities from large corpora. In: Proceedings of the 42nd annual meeting on association for computational linguistics, p 415
Hasegawa T, Sekine S, Grishman R (2005) Unsupervised paraphrase acquisition via relation discovery. In: 11th annual meeting of the Japanese association for natural language processing
IRSIG-CNR (2002–2006) Astrea, information and communication for justice. Italian Research Council/Research Institute on Judicial Systems (IRSIG-CNR)
Karaa WBA, Gribâa N (2013) Information retrieval with porter stemmer: a new version for English. In: Advances in computational science, engineering and information technology, pp 243–254
Ku CH, Iriberri A, Leroy G (2008) Natural language processing and e-government: crime information extraction from heterogeneous data sources. In: Ninth international conference on digital government research, pp 162–170
Ku CH, Iriberri A, Leroy G (2008) Crime information extraction from police and witness narrative reports. In: IEEE conference on technologies for Homeland security, pp 193–198
Lin D, Pantel P (2001) Dirt—discovery of inference rules from text. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 323–328
Loper E, Bird S (2002) Nltk: The natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1, pp 63–70
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781:1–12
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546:1–9
Mohamed TP, Hruschka ER Jr, Mitchell TM (2011) Discovering relations between noun categories. In: Proceedings of the conference on empirical methods in natural language processing, Association for Computational Linguistics, EMNLP ’11, pp 1447–1455
Pinheiro V, Furtado V, Pequeno T, Nogueira D (2010) Natural language processing based on semantic inferentialism for extracting crime information from text. In: IEEE international conference on intelligence and security informatics (ISI), pp 19–24
Rendón E, Garcia R, Abundez I, Gutierrez C, Gasca E, Del Razo F, Gonzalez A (2008) Niva: a robust cluster validity. In: Proceedings of the 12th WSEAS international conference on communications, pp 241–248
Rosvall M (2009) Infomap. www.mapequation.org/code.html
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(Supplement C):53–65
Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Third international conference on language resources and evaluation (LREC-2002), pp 1818–1824
Sekine S (2005) Automatic paraphrase discovery based on context and keywords between ne pairs. In: Proceedings of IWP, pp 4–6
Shabat H, Omar N, Rahem K (2014) Named entity recognition in crime using machine learning approach. In: Information retrieval technology, pp 280–288
Shabat HA, Omar N (2015) Named entity recognition in crime news documents using classifiers combination. Middle-East J Sci Res 23(6):1215–1221
Syed Z, Viegas E (2010) A hybrid approach to unsupervised relation discovery based on linguistic analysis and semantic typing. In: First international workshop on formalisms and methodology for learning by reading, pp 105–113
Weir G, Anagnostou N (2007) Exploring newspapers: a case study in corpus analysis. In: ICTATLL Workshop
Zhang M, Su J, Wang D, Zhou G, Tan CL (2005) Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In: Second international joint conference on natural language processing, pp 378–389
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that this manuscript has no conflict of interest with any other published source and has not been published previously (partly or in full). No data have been fabricated or manipulated to support our conclusion.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Das, P., Das, A.K., Nayak, J. et al. A framework for crime data analysis using relationship among named entities. Neural Comput & Applic 32, 7671–7689 (2020). https://doi.org/10.1007/s00521-019-04150-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04150-8