1. Introduction
Crimes are the most common social issues nowadays, affecting the economic growth, quality of life, and economy of any country. Crimes affect the reputation of a country on an international scale and affect the economy of the country by placing a financial burden on the government in hiring additional police forces. For the eradication of crimes, the government needs to adopt some optimized strategy [
1] and sustainable e-governance information systems. Algorithms that predict the occurrence of crimes based on time and location can help the government to deploy law enforcement in highly dangerous areas [
2].
Internet-based news resources, such as online newspapers and news channel archives, have been tremendously increased in number, volume, and coverage, and they contain useful as well as authentic data [
3]. Nevertheless, the data of the archives are not so arranged and categorized, so it can be quite challenging to extract useful information about specific or interesting crime events [
4,
5,
6]. According to the Pakistan Bureau of Statistics, the crime rate of Pakistan is increasing constantly, and among all the crimes, the rate of murder, kidnapping, robbery, accidents, and blasts is high. News archives provide a valuable source of information. It contains rich and purposeful content which is recorded carefully by specialists and it portrays some principal aspects related to the specific article [
7]. The most popular and authentic newspaper’s archives of Pakistan are Dawn News, Dunya News, Ary News, The News, Daily Times, Pakistan Press Foundation, The Nations, and Journalism Pakistan [
8]. The purpose of this research work is to utilize free of cost data available in news archives and perform the spatiotemporal analysis for crime prediction. NLP is an efficient mechanism to extract the keywords as representative of the whole text of the news body and researchers have used different Natural Language Processing (NLP) techniques for mining the data of news web archives [
7].
Similarly, geostatistical-based approaches have been used by different researchers to identify the high-risk regions [
1,
2,
9]. The development in Geographical Information Systems (GIS) approaches has enabled the analysis of spatial data in different domains. GIS-based approaches provide the visualization and exploration of incidences by creating map layers as visualization of spatial data which can help detect the patterns and trends of criminal networks. Hence, the spatial distribution of crime data for the prediction of future crime events using data mining and machine learning on the spatial dataset can provide accurate distribution of crimes [
10]. These types of novel methods for crime mapping can be helpful in many aspects of society, such as decreasing the probability of accidents, crime ratio, and murder cases. Moreover, it can secure the nation from blasts, kidnappers, and murders [
11]. This study focuses on information retrieval from the news archives, extraction of attributes from the news headlines, and applying spatial analysis as well as machine learning to predict future crimes.
Crime-solving is a complex task that requires human efforts and intelligence for the processing of criminal data. Therefore, data mining can assist researchers in crime identification problems [
12,
13,
14]. Researchers have done extensive research on the usage of data mining and machine learning techniques in the identification and prediction of crime events and criminal networks [
15]. There are many data mining and machine learning tools available for researchers. Weka is one of the tools which can assist researchers in mining the data and applying certain machine learning algorithms [
16]. It has the capabilities of performing preprocessing, feature selection, clustering, and classification on data [
17]. In this study, the algorithm of KNN predicted the crime type with 92% accuracy.
In today’s world, security is the most promising aspect which should be provided by the government to their citizens. The principal objective of crime mapping is to estimate the probability and ratio of any mishap happening in the country. The objectives of this study include:
To predict the crime patterns through news archives data and extract the crime information from the news text using freely available tools for the developing and under-developed countries that have a paucity of resources; considering Pakistan as an example.
To help law enforcement agencies, in anticipation of the crime rate by analyzing the spatial distribution trends promptly.
To predict the behavior of criminal networks by estimating their next move using machine learning algorithms.
In a nutshell, this study presents the feasibility to apply geospatial methods and machine learning approaches in order to predict the crimes/criminal activities using the eight years of data available in web archives.
The rest of the paper is structured as follows.
Section 2 presents the state-of-the-art methods proposed in the literature for the targeted problem.
Section 3 represents the overall methodology of our contribution to crime prediction. Spatial Crime analysis is performed in
Section 4 while crime prediction is performed in
Section 5. We perform evaluations in
Section 6. The results are discussed in
Section 7 and
Section 8 concludes the study.
2. Related Work
With the advancement in technology, criminal behavior is becoming more and more channeled and complex [
4]. For crime control, the nature of crime must be understood [
18]. The spatial analysis helps to decode the spatial behavior of criminal activities [
18] and assist law enforcement in making predictions about future crimes that may occur [
1].
Many crime prediction approaches have been proposed earlier by different researchers. Agarwal et al. proposed the framework of crime prediction in which crime analysis is performed on crime datasets by k-means clustering using a rapid miner tool [
4]. However, there is a need to apply machine learning. Kiani et al. proposed a new framework for clustering and crime prediction in which they used the Genetic Algorithm (GA) for the detection of outliers. Their main focus was to classify the crime cases based on the frequency of crime occurrence during different years [
19]. Reddya et al. used the tools of R such as Rgoogle maps [
1], googleVis [
1], ggplot2 [
1], and ggmap for visualization of criminal data. They used the K-Nearest Neighbor (KNN) algorithms and Naïve Bayes algorithms to help the prediction of crimes [
1], however, they can use more advanced methods of machine learning and apply spatial analysis as well. I. Matijosaitiene et al. proposed the method of crime prediction using land-use data with the help of machine learning algorithms. They identified the exact hours of crime occurring using hotspot analysis by using logistic regression and determining the precise time of the next crime [
2], but the prediction results can be enhanced using advanced methods of machine learning.
Malathi et al. proposed the model of crime prediction using data mining techniques. The model consisted of data cleaning, data clustering, classification, and outlier detection [
12]. In [
20], Ivan et al. used GIS to visualize the spatial distribution of accidents along with the road networks. They identified the spatial patterns of road-side accidents along with its occurrence in different moments such as hours, days, seasons, and years, etc. Thakali et al. used kernel density estimation and kriging for identifying the hotspot of crashing incidents and estimating the collision frequency, respectively [
9]. Haan used kernel density for estimation of concentration at a given point in space [
21]. Xue et al. proposed the method of spatial analysis with a latent decision for crime prediction. They designed two different spatial models for crime prediction such as uniform and distinct. Both models helped in understanding the spatial pattern of crimes and criminal behaviors [
11]. In [
22] the authors design a deep neural network for the crime prediction by utilizing the New York crime dataset while in [
23] Duan et al. predicted the crime suspect location using spatiotemporal analysis. Hu et al. designed a Bayesian model for urban crime prediction based on regional statistics [
24].
Pflueger used a random forest algorithm to predict criminal activities by offenders having a mental illness. This approach can be helpful not only for the judiciary but also for designing new strategies for risk management [
25]. Almanie et al. found crime patterns using decision trees and Naïve Bayesian classifiers. They predicted the future crime events in a particular location (latitude, longitude) within a specific time interval. They combined demographic information with the findings of the crime dataset of cities and then estimated which factor is affecting the neighbors the most [
26]. The crime hotspot and spatial analysis can help to identify the spatial crime patterns. Jangra et al. compared the prediction rate and accuracy of KNN with the Naïve Bayes over the crime dataset. They used the previous scenarios of KNN over crime prediction and compared with their proposed scenario of Naïve Bayes and found out that both the techniques showed different accuracy rates. Jangra et al. reported that the accuracy of Naïve Bayes is higher than the KNN algorithm. Moreover, they emphasized that such types of techniques in combination with spatial datasets can predict crime-related data in an efficient manner [
10].
Table 1 gives a summary of the related work.
From the literature, it has been determined that several approaches to GIS have been proposed to identify crime patterns and trends. However, there is a lack of research that predicts location-based crimes in adjoining areas of Pakistan utilizing free of cost data available in news archives. The freely available data can be transformed into useful information using natural language processing algorithms and prediction can be performed using supervised and unsupervised learning. Such kind of research can help identify future crimes cases in the developing and under-developed countries having a paucity of financial resources.
4. Spatial Crime Analysis
Crime analysis is defined as the analytical process that identifies crime patterns and trends related to crime data, which assist in deploying strategies and planning for future crime prediction [
4]. We have performed the spatial crime analysis using the spatial data we extracted from the web news archives in order to investigate the trends of crime geographically. Spatial crime analysis means to study the spatial distribution of the crime rate that either the crime features are clustered, random, or dispersed. It shows the spatial correlation between the features points of crimes and identifies the trends among the crime patterns. Spatial crime analysis involves a collection of statistical techniques to discover spatial patterns, spatial clusters, and spatial trends in criminal data. Researchers declared that crime is not a random activity; instead, it is spatially concentrated in most of the cases [
12]. The objectives of spatial analysis are to identify the relocation patterns of the criminals. We can estimate the next move of the criminals with the help of various geospatial methods such as hotspot analysis etc.
It is necessary to know how crime data is spatially distributed. To investigate this, we identified the relationship between crime features using the average nearest neighbor using the spatial dataset of crime that we extracted from the news archives. Cluster analysis is also used to study the distribution of crimes. We have used k-means [
36,
37] clustering for cluster analysis over the spatial data of crime. Clusters are formed in a region where there is a greater tendency of the crime rate. Pattern analysis also gives the spatial interaction between the locations, which is used in the estimation of heterogeneity and dependence of crime over other factors [
12].
4.1. Analysis Using Average Nearest Neighbor
According to the Routine Activity Theory, the behavior patterns of people and their environment has a significant impact on criminal activities. Therefore, to identify and explain the relationship between neighborhood and crime characteristics is a key aspect [
12]. It is necessary to know how the crime data is spatially distributed, i.e., either clustered, random, or dispersed. We have used the average nearest neighbor, a statistical tool in ArcGIS, to measure the autocorrelation between crime features in our spatial crime dataset. Average nearest neighbor is the tool that measures the distance from the center of each point to the centers of its neighbors. Further, it calculates the average of all the calculated nearest distances. The average distance is compared with the average of the hypothetically which gives a random distribution using the formula given in Equation (
1). The average nearest neighbor is calculated as the ratio between the observed distances of each feature to the expected distance.
In average nearest neighbor, if the value of the nearest neighbor ratio is less than one, it indicates that the patterns are clustered as shown in
Figure 4.
Figure 4 is obtained and built using the average nearest neighbor tool in ArcMap. In the case that the value is greater than one, it shows that the patterns are dispersed in the relationship. Hence,
Figure 4 shows that in our data, the spatial correlation of feature points is clustered.
4.2. Clustering Using K-Means Clustering
Clustering is the technique of data mining that groups the objects in sets of similar features or properties and each set differs from others in its behavior [
4]. It can help in the prediction of crimes based on spatial distribution by analysis of the clusters [
12]. In this study, we used the k-Means algorithm to perform clustering using a crime dataset because it is applicable over the large datasets and has less complexity as compared to other clustering algorithms [
19]. The Weka tool is used in this study for performing k-Means clustering. In k-Means clustering, k clusters are formed from n observation based on the nearest mean. The process of k-Means clustering involves:
Declaring the number of clusters as k.
Choose the centers of each cluster.
Each instance is assigned to the cluster, which is the nearest.
The centroids of clusters are recalculated.
The process is iterated.
Table 4 shows the centroids of each cluster formed through the k-Means algorithm. The total data is divided into eight clusters numbered as 0 to 7.
Table 5 shows the distribution of clusters based on the crime type. Clusters’ names are assigned based on the centroid.
Figure 5 has been built using the Weka tool and illustrates the clusters of crime with respect to their latitude. We just included the centroid of the central cluster in
Table 4 which is obtained as a result of K-means clustering in Weka and helps to identify the ratio of different crimes among cities of Pakistan.
7. Results and Discussion
Crime prediction is one of the most challenging tasks, especially when data availability of criminal reports is not up to the mark [
45,
46]. Electronic media is one of the most powerful tools, which can provide accurate data and remains useful for the conduction of the research. Data mining tools helped in managing the data in an understandable format which led to meaningful information for answering crime patterns and their relationships.
In our prediction model, we used two machine learning algorithms for the prediction of crime events on the archive dataset. The results of these two algorithms were compared in terms of accuracy and prediction. The average accuracy of the KNN and Random Forest was observed as
and
, respectively. This indicates that the prediction of KNN remained high and efficient as compared to Random Forest.
Table 6 and
Table 7 also show the results of both of the algorithms along with the parameters.
Table 6 shows the different values of accuracy, precision, recall, and F-measure against different parameters of the KNN algorithm. The results show that the values of matrices are being high when the number of K is increasing and we got maximum values when K was equal to 9.
Table 7 represents the number of trees as the parameters and the values of accuracy, precision, recall, and F-measure against them. We have achieved the higher values of matrices with a higher number of trees in the random forest algorithm.
KNN predicts the most accurate result because it can reduce the adverse effects caused by improper classification of features and reduce the errors of classification [
5]. In this method, surrounding samples play their role to classify each sample. Therefore, considering the class of nearest neighbour samples, the class of unknown sample can be predicted. In the test and training datasets, distances between unknown samples of the test data and samples of training data were computed. The unknown sample of the test data has been assigned the value of the smallest distance corresponding to the sample in the training set [
6]. The reason for getting high accuracy by KNN may be because this algorithm selects the features based on a distance between points, considering the points having nearby crimes occurring in the archived datasets may lead to higher accuracy.
An automated duplication removal process can increase the data extraction process. Similarly, usage of advanced machine learning such as reinforcement learning and deep learning algorithms may give better results. Moreover, automatic geo-coding methods for the extraction of precise locations can identify the exact location of the crime. Such type of integrated model will help decision-makers and law enforcement agencies predict the more precise location of crimes for getting fruitful results. As mentioned earlier, the challenge in this research was to extract data without any cost or ground survey in an efficient manner. Such an automated process is quite useful for developing and under-developing countries where geospatial data is not being maintained or shared.
8. Conclusions
The usage of digital information archives is a cost-effective way of predicting crime events occurring in a country. Data related to crimes extracted through automated tools such as Python can be converted into useful information for the prediction of fruitful results. The location-based geo-coded data, processed through GIS-based software, i.e., ArcMap provided locations based statistical information, which helped in the identification of the patterns, trends, and relationships between crime features. Furthermore, the hotspots analysis assisted in identifying the areas and regions of high susceptibility. Such types of research can be quite helpful to law-enforcement agencies to monitor highly sensitive areas and to remain in high alert in terms of security. KNN and the Random Forest algorithm concluded that Pakistan has the worst condition in a robbery as compared to other crimes. Such types of a robust method can be an effective way to keep an eye on risk-prone areas. In conclusion, such types of automated processes can open new ways of handling a peaceful and sustainable society in eradicating crimes for the developing and under-developed countries having a paucity of financial resources.
Due to limitations of time, availability of data, and lack of resources, we were only be able to extract limited datasets, i.e., 900 crime records at the city level. There is a possibility of uncertainty in the number of crime cases because the data has been extracted from particular news archives. Adding other electronic resources, mainly from the local language, can increase the accuracy of the dataset. Moreover, the unpredictability and uncertainty in the crime rate is still a challenge for researchers and decision-makers. This is because various other factors affect the crime rate simultaneously such as criminal mental state, poverty, low income, unemployment, illiteracy, family pressure, bad company, etc. [
47,
48]. By adding socioeconomic data, precise locations of crimes, and data from other electronic resources, a useful prediction model can be developed. In addition to that, the demographic data (population density) of Pakistan can help us to improve the crime prediction. It can show how the population distribution is associated with the crime rate in Pakistan. Similarly, some other potential biases such as information bias can produce more fruitful results for crime prediction.