1. Introduction
Global Terrorism is the use of intentional violence against civilians and neutral militants for the purpose of political, cultural, and social benefits around the world [
1]. Due to this, individuals, business owners, government, and private and public organizations are at risk [
2]. The administrative expense of combating global terrorism is too high [
3,
4]. For example, the 9/11 terrorist attack caused an economic loss of approximately US
$123 billion, while the London bombing cost UK £2 billion [
5]. In the era of big data, the advent of the latest technologies, the use of social media, mobile technologies, and cloud drives has led to the generation of different types of voluminous data in the form of big data. Criminals also use all of these platforms to exploit information, harm society, and globally dispute criminal activities, thus leading to financial loss, cyber war, and cybercrime. It is a major challenge for governments, social justice organizations and security agencies to fight terrorism all around the world. This work, to the best of our knowledge, is the first attempt to retrieve sensitive data from big data through the use of Hadoop, annotations, lemmatization, TF-IDF, and SVD. As we determine criminal data from huge mass of unstructured data sets using distributed computing environments, this work constitutes a milestone for national and international agencies in combating global terrorism.
Terrorism has gained global attention due to its negative impact on the economy [
6] and its effect on developing countries. It is also against international law and human rights. Therefore, it cannot be tolerated or justified. It must be prevented and fought at national and international levels [
7,
8,
9]. Terrorists try to achieve political or social objectives through the use of violence by individuals or groups, while also attempting to attract the focus of a large audience compared to the immediate target. It creates unwanted pressure on governments for the purpose of political or personal benefit. In [
10], sensitive data is a term used to describe the data related to murder, criminals, burglaries, explosions, domestic violence, druggists, thieves, and cybercriminals that is present in big data. Normally, terrorists use techniques like bombing, kidnapping, assassination, suicide attacks, plane hijacking, etc. [
11]. Terrorist organizations make alliances with other terrorist groups, resulting in a strong and active network of terrorist organizations. This type of relationship increases a group’s capacity and increases the ability to regulate deadly attacks. It is more harmful when they establish relationships with the strongest groups to whom they can gain access [
12]. Although the study of terrorism and response to terrorism (START) [
13] deals with a network of scholars studying the causes and consequences of terrorism, it remains a global challenge in monitoring terrorist groups to identify criminal information in big data for the estimation of the potential risk to countries, cities, and societies.
Big data is the collection of large data sets generated from many domains [
14], including social media, cloud drives, sensors, geospatial locations, and mobile technologies. These data are in unstructured, semi-structured and structured formats [
15]. Big data is characterized by the seven Vs: volume, velocity, variety, veracity, variability, visualization, and value [
16]. Volume represents the size of the data sets, velocity represents the rate at which data is transmitted, variety represents different types of data sets, veracity refers to anomalies and uncertainties in the data, variability concerns data whose meaning is constantly changing, visualization refers to illustrating the relationship among data, and finally, value represents the context and importance of the data [
17]. Among all of these characteristics, data size is the most distinctive feature of big data analytics compared to the other features [
18]. The use of big data analytical tools and techniques [
19] is very useful for identifying sensitive data, because they use parallel and distributed computing to process voluminous data [
20]. To prevent unexpected event that may occur in future, security agencies must be one step ahead of terrorists [
21]. Criminals make policies and sometimes change their moves very quickly. In these scenarios, the use of big data analytics will be helpful in identifying criminal activities even if they change their moves and policies [
22]. To predict future terrorist incidents, Toure and Gangopadhyay [
23] developed a risk model for computing terrorism risk at different locations, while Nie and Sun [
24] carried out systematic research for counter-terrorism using quantitative analysis, demonstrating the impact of big data in fighting terrorism through data collection, processing, monitoring, and warning. However, these approaches are inefficient for processing the huge and exponentially growing amount of data, and it is also difficult for traditional statistical methods of computation to capture, store, process, and analyze such data in a given time frame [
25,
26].
To process massive unstructured data sets, parallel and distributed computing frameworks, along with Hadoop clusters, are widely used with Yet Another Resource Negotiator (YARN), which is employed by Apache Software Foundation in Hadoop 2.0 [
27]. It uses container-based resource partitioning to assign resources to the subdivided units of computations. With the use of Hadoop, Hu et al. [
28] designed and implemented a job scheduler without previous knowledge of the job size, and effectively mimicked shortest job first scheduling. This job scheduler assisted the design of multiple level priority queues, where jobs were first assigned to lower-priority queues if the amount of service reached a certain threshold. In [
29], Kim et al. introduced the benefits of input split in decreasing container overhead without the modification of the existing YARN framework, presented a logical representation of HDFS blocks, and increased the input size of the container by combining the multiple HDFS blocks. Meanwhile, Zhao and Wu [
30] implemented elastic resource management for YARN to increase cluster efficiency through better use of hardware resources by employing a fluctuating container size to fulfill the actual resource requirements of the tasks. Despite the fact that several studies have been completed with Hadoop and YARN, and they have demonstrated their importance in the field, scanty or no literature can be found on the retrieval of criminal data using YARN. Moreover, Hadoop and YARN alone cannot solve the problem of mining criminal data until algorithmic methods have been systematically implemented for further research using Hadoop clusters.
Term Frequency–Inverse Document Frequency (TF-IDF) is a numerical method for determining the importance of a word in a collection of documents [
31]. The weight factor of TF-IDF increases with increasing number of words present in the document, but decreases with the number of documents present in the corpus. In [
32], Yan and Zhou formulated a binary classification problem for modern aircraft maintenance systems in which faults occurring in past flights were identified from raw data, and Term Frequency–Inverse Document Frequency was applied for feature extraction. To classify documents for news articles in Bahasa, Indonesia, Hakim et al. [
33] implemented the TF-IDF classifier along with tokenization, duplicate removal, stop word removal, and supervised word removal, and achieved terms with higher accuracy. Meanwhile, Inyaem et al. [
34] employed TF-IDF to classify terrorism events through the use of similarity measures and implemented a finite-state algorithm to find feature weights and event extraction algorithms to enhance the performance. According to [
35], Alghamdi and Selamat implemented TF-IDF and the Vector Space Model (VSM) to find topics in Arabic Dark websites, and the K-means clustering algorithm to group the terms, and they also implemented the concept of filtering, tokenization and stemming. In order to mine criminal information from online newspaper articles, Arulanandam et al. [
36] employed Named Entity Recognition algorithms to identify thief-related information in sentences based on locations, and achieved 80–90% accuracy, but they had still employed a traditional method of computation. Most of this research adopted the numerical method of TF-IDF to determine the importance of words or to classify terrorism events, but a means for combining TF-IDF with data reduction techniques in order to reduce the size of the data and expose criminal information has yet to be reported.
Dimensionality reduction is an important data processing technique [
37] when processing huge amounts of data using singular value decomposition (SVD) [
38]. The complex representation of higher dimensional data makes the decision space more complex and time-consuming in big text classification [
39]. Therefore, SVD is one an important tool for generating different data sets by maintaining the originality of the information. In [
40], Radhoush et al. utilized SVD and the Principle Component Analysis (PCA) algorithm to feed compressed big data into the state estimation step, and showing that compressed data save time and speed up processing in distributed systems. To measure the difference between the original dataset and the distorted dataset for the degree of privacy protection, Xu et al. [
38] proposed a sparsified SVD for data distortion techniques. According to [
41], Feng et al. proposed an orthogonal tensor SVD method in the field of cybersecurity and cyber forensics, achieving promising results for higher-order big data reduction. In [
10], Adhikari et al. implemented two different machine learning algorithms—KNN and NN—to identify sensitive data from big data with Hadoop clusters, but they only considered the performance measures for mining the criminal information. Cho and Yoon [
42] presented an approach for identifying important words for sentiment classification using set operations and implementing truncated singular value decomposition and low-rank matrix decomposition techniques to retrieve documents in order to resolve issues related to noise removal and synonymy. Based on this research, we can see that SVD has been implemented for the reduction of the size of voluminous data in order to feed compressed big data into the state estimation, to measure the difference between the original dataset and the distorted dataset, and to resolve issues related to noise removal, but there exists a big literature gap with respect to identifying accurate criminal data in big data using Hadoop clusters for combating global terrorism.
According to the above studies, we argue that the existing approaches are insufficient to account for the exponential growth of data, and we also find that the traditional computational methods are ineffective in the retrieval of criminal data from voluminous data. Moreover, it is also argued that the algorithmic approach should be implemented with the Hadoop cluster to achieve higher performance for the retrieval of sensitive data. Based on these arguments, this research is aimed at the accurate identification of criminal data from the huge mass of varieties of data collected from diversified fields through the use of Hadoop clusters to support the criminal justice system and social justice organizations in fighting terrorist activities at a global scale. The achievement of this experiment is rigorously compared using the same set of hardware, software, and system configuration. Moreover, the efficacy of the experiment was tested using criminal data with respect to concepts and matching scores.
2. Research Methodology
To identify sensitive information within big data, several methods were implemented chronologically in a distributed computing environment. A systematic representation of the implementation of the methodology is depicted in
Figure 1. Firstly, documents that contained criminal information were stored in the Hadoop Distributed File System. To achieve parallelism and work distribution in Spark, Resilient Distributed Datasets (RDD) were used. Moreover, Spark API was used to read the files in a folder. To provide efficacy of the work through parallelism, SparkContext Driver, Cluster Manager, Worker Nodes were communicated with each other. Moreover, to find patterns and inferences in the datasets, we applied Annotation, in which raw data were passed to the Annotation Object, and after processing with the relevant function, an annotated text was generated. In the next step, to acquire the base or dictionary form of a word from plain text, Lemmatization was implemented with the help of Standard Core NLP project. Still, there were several undesired terms that did not carry too much meaningful information for the analysis of criminal information. Therefore, to reduce the dimensional space, we implemented the StopWord Remover functions that were present in Spark API. Subsequently, we also calculated term frequency and inverse document frequency to extract features from the raw data. For this, we employed CountVector API to find term frequency and generated sparse representation of the document. For quicker analysis of the data, we instigated RowMatrix API for calculation of SVD and computeSVD function for the decomposition of the matrix into its constituent three different matrices, i.e., term matrix, diagonal matrix, and document matrix. Finally, we carried out commands to submit Spark jobs and extracted criminal information. The methodology for the mining of criminal information is systematically presented in the figure.
2.1. Parallelization
A parallel processing system was devised to speed up the processing of programs by breaking single task units into multiple tasks and executing them simultaneously. In this regard, Spark is very pragmatic for scaling up data science workloads and tasks. Using Spark libraries and data frames, the massive data sets were scaled for distribution across a cluster for parallel processing. The parallelization of algorithms is most significant for resolving large scale data optimization and data analytical problems. A Spark application on a Cluster is graphically presented in
Figure 2.
For processing, the job is submitted to the SparkContext driver program, which transfers applications to the cluster manager. The cluster manager communicates with the worker node to create containers that open Executors, and the Executor again interacts with the SparkContext Driver program to carry out parallel processing. The work distribution in Spark is accomplished by means of Resilient Distributed Datasets (RDDs) that can run a wide range of parallel applications, and many programming models that have been projected for interactive computation.
To speed up the work, we stored our data in the Hadoop Distributed File System and implemented Spark API to read files and folders using sc.wholeTextFiles(). The DataFrame with filename was adopted as the label column, while all texts were accepted as sentence columns. As DataFrame is an optimized version of RDD, it has been selected as a suitable API for various machine learning algorithms.
Dataframes were chosen as suitable APIs developed for different machine learning algorithms. The pseudocodes for implementing parallelization are presented in
Appendix A to read data and then convert it to the DataFrame. Then the DataFrame is distributed to executors using mapPartitions API, which distributes the partitions equally to Executors for processing. Spark provides parallelism by using all of its components, but to achieve our goal, we implemented annotators to find patterns and inferences among the data.
2.2. Annotators and Annotations
Annotation is the process of preparing data for finding patterns and inferences by adding metadata tags to web documents, social media, image data, video data, and text documents. The research conducted by Yang and Lee [
43] proposed an effective method for the recommendation of tags to annotate a web page and developed the relationships between web pages and tags based on clustering results. In other research, Liu et al. [
44] also worked for semantical enhancement of web documents for recommendation and information retrieval systems, but they used a semi-automatic annotation system to annotate metaphors through the use of high abstraction terms. In the research, Ballan et al. [
45] reviewed the state-of-the-art approaches to tag refinement for social images and also presented localization of web video sequences and tag suggestions by the use of automatic annotations. Beltagy et al. [
46] presented a well-organized annotation system by breaking down the documents into segments, mapping the segment headings to concepts to extend the ontology. Motivated by all of these works, we implemented annotation to extract concepts from the huge mass of text documents. The working mechanism of the annotators is illustrated in
Figure 3.
In our work, functions like tokenize, parse, NER tags, ssplit, and pos were employed for tokenization, sentence splitting, part-of-speech tagging, morphological analysis, named entity recognition, and syntactic parsing of the text data. The annotated texts were generated by passing raw text to the Annotation Object and by the implementation of all the defined functions. The pseudocodes for tokenizing the words for lemmatization and breaking the annotated text into sentences are presented in
Appendix B.
2.3. Lemmatization
With the annotated text in hand, next, we turned it into a bag of terms in their original forms through morphological analysis and the elimination of inflectional endings. Words with the same meaning can take different forms. For example, “criminal” and “criminals” do not justify separate terms. Merging these different inflectional terms into a single term is called stemming or lemmatization [
47]. Stemming implements a heuristic-based technique [
48] to remove characters at the end, while lemmatization practices employ principled approaches to reduce inflectional forms to a common base form [
49] by recursive processing in different layers. Miller [
50] extracted words from the WordNet dictionary, but this was limited to the convenience of human readers and a combination of traditional computing with lexicographic information. As our work is related to extracting the base terms from annotated text, the Stanford Core NLP project [
51] provided the best solution. It has an excellent lemmatizer, which uses Java API with the Scala. The pseudocodes presented in
Appendix C employed the RDD with plain text documents for lemmatization. Then, the lemmatized tokens are converted to DataFrames for further processing.
2.4. StopWords Remover
StopWords do not carry much meaning in the text and can be removed from the text analysis process. For example, words like “the”, “is”, “a” take up space but do not carry valuable information for the model. The spatial dimensions can be reduced by removing all the StopWords from the text by using algorithms or functions. Magdy and Jones [
52] applied the StopWords removal technique in information retrieval text pre-processing, but this was only limited to the Machine Translation corpus. As our work is related to the RDD environment, we found an efficient method to remove StopWords from the text using Spark API. The snippet used for the removal of StopWords is presented in
Appendix D.
2.5. Term Frequency–Inverse Document Frequency (TFIDF)
After the successful implementation of Lemmatization and StopWord Remover, we had an array of terms corresponding to the document. The next challenge was to compute the frequencies of the terms within the document, as well as in the entire corpus so that the importance of the words can be found. For this, we implemented Term Frequency–Inverse Document Frequency. The term frequency counts the number of terms and defines the weight of the terms in the document. The mathematical derivation for the calculation of the term frequency is as follows:
Term frequency is used to extract features from raw data. It is as follows:
where
f (
i,
t) is a function which is defined as
The mathematical representation of inverse document frequency is as follows:
where
represents the term
t which is present in document
d when
tf(
t,
d) ≠ 0, and 1 is added to the formula to avoid zero division.
The
tf-idf combines term frequency and inverse document frequency and its mathematical derivation is as follows:
The CountVectorizer algorithm selects top-frequency words to form vocabulary and generates a sparse representation for the documents over the vocabulary [
53]. CountVectorizer is chosen against HashingTF, as HashingTF is dependent on hashing of terms, and collision may occur between terms while hashing. Vocabulary size is limited, with numTerms being variable. In our work, we implemented CountVectorizer API to find the term frequencies and pseudocodes, which are presented in
Appendix E, to describe the steps to find the term frequencies. For the generation of a sparse representation of the document and to calculate TF-IDF, the snippets of codes are presented in
Appendix E. IDF is an estimator which takes the vectorized counts produced by the CountVectorizer (TF algorithm used) and gives a weighting to the terms which are frequently used in the document.
2.6. Singular Value Decomposition (SVD)
At this point, we have constructed the term-document matrix M, which can be processed further for factorization and dimensionality reduction. The SVD takes the
matrix as the input and returns three different matrices, i.e., term matrix, diagonal matrix, and document matrix.
U is a matrix where the terms are represented by rows, while columns correspond to concepts.
S is a diagonal matrix whose entries represent the strength of the individual concepts. The magnitude of the individual values is signified by the importance of the concepts, and this importance can be shown by the variance in the data. A key perception of the LSI can be explained by the small number of concepts that are used to describe the data. The entries in the diagonal matrix S are directly related to the importance of each concept.
V is a matrix where documents are represented by documents, while the concepts are characterized by columns.
In Latent Semantic Analysis (LSA), m represents the number of terms, while n represents several documents. The decomposition of the matrix is parameterized with a number k, which is less than or equal to the original number n, which indicates how many concepts should be considered during analysis. If , the product of the matrices constitutes the original matrix. However, if , the product of the matrices generates a low-rank approximation of the original matrix. Normally, k is selected to be much lower than n in order to explore the concepts signified in the documents.
As we had RDD data to analyze, we implemented RowMatrix API to calculate Singular Value Decomposition. The snippet of the codes is presented in
Appendix F. A RowMatrix is backed up by rows of an RDD, where each row represents a local vector. This is a row-oriented distributed matrix that has no meaningful row indices, and it was used to calculate the column summary statistics and decomposition. The computeSVD is a dimension reduction function that is used to decompose a matrix into its three different constituent matrices.
2.7. Command to Execute Spark Job
To prove the efficacy of our work, we extensively evaluated the experiment against criminal data by executing the Spark job by using commands. The commands for running the code are presented in
Appendix G.
2.8. Running Environment
Experiments were conducted on a five-node cluster, with the configuration of each node being an Intel
®® quad-core CPU @2.70 GHz, 8 GB RAM and 500 GB storage with Ubuntu 16.04LTS operating system. The master and the slave environments were also set up for the experiment. For the experiment, four text files were collected from website:
http://www.gutenberg.org/ and employed for the processing. These files contained varieties of sensitive text data in an unstructured way. The files were “mob_rule_in_new_orleans.txt”, “murder_in_the_gunroom.txt”, “the_Leavenworth_ case.txt”, and “the_way_we_live.txt” and their data sizes were 1.41 MB, 4.02 MB, 6.435 MB, and 3.88 MB, respectively. SVD is an algorithm that transforms a corpus/documents into matrix format so that the searching of texts is easy and fast. Therefore, the training and test data depend on the number of terms used and the decomposition factor (
k) used. the higher the values, the higher the volume of training and test data, and vice-versa. In this experiment, validation data is manual, which involves looking at the matching score when certain words were searched.
4. Conclusions
Due to the fact that terrorism creates a negative impact on the economy, establishes relationships with other terrorist groups, and creates potential risks for societies at national and international levels, this paper chose to study methods of identifying sensitive data from big data using a distributed computing environment, annotation, the StopWords Removal technique, Lemmatization, Term Frequency and Inverse Document Frequency, and Singular Value Decomposition. From the experimental results, it was found that the identification of sensitive terms was carried out with 100% accuracy, while matching of different crime-related terms with documents was performed with 80% accuracy. Furthermore, this study was carried out using a huge number of different types of unstructured data collected from diversified fields through the successful implementation of the Hadoop cluster. Therefore, it exceeds the ability of traditional statistical methods of computation to store, capture, process and analyzes data within a fixed time frame in order to support the criminal justice system in supervising criminal activities. Moreover, to identify criminal data within the huge mass of data, several different algorithms, including parallelization, annotator and annotation, Stop Word Remover, Lemmatization, TF-IDF, and SVD, were implemented in combination to achieve higher accuracy and performance, so that this research can help social justice organizations and the criminal justice system to minimize administrative expenses and combat terrorism at a global scale.
In order to identify criminal data from within the huge mass of different types of data, this research could be further extended through the implementation of classification algorithms like support vector machine, K-nearest neighbor algorithm, etc., in distributed computing environments in order to improve efficacy. Furthermore, the existing approach could be applied to cloud computing platforms so that the performance and accuracy could also be evaluated online to detect criminal data to fight global terrorism.