1. Introduction
Over the last few decades, forensic linguistics has developed and used a type of language analysis that has helped put in place reliable ways to find plagiarism. Forensic linguistics research, which looks at how language affects the law, has shown that it is possible to figure out how likely it is that two or more texts were written independently. So, this analysis can be used as both a way to find out more and as proof, not just in legal situations but also in ethical ones [
1,
2,
3,
4,
5]. Today, more and more cases of plagiarism are being reported. This could be because of one or more of the following: easy access to information; intense pressure to publish in academia for career advancement; lack of confidence and writing skills; or writing manuscripts quickly or under stress to meet a deadline. Also, because authors do not know what plagiarism is, they do not know that it is wrong to copy and paste word-for-word, even if they give a reference to the original text. Plagiarism detection (PD) methods look for text that is similar or the same between two or more documents [
6]. As most plagiarists reuse the text from other source papers to disguise plagiarism by changing terms with synonyms or paraphrasing, and maybe by rearranging the sentences, detecting plagiarism can be a very difficult process. On the other hand, it has inspired the creation of automated detection methods. Publishing houses have recently shown an eagerness to combat plagiarism [
7].
Current PD approaches might have some shortcomings that reduce their effectiveness in detecting plagiarized texts. Here are the issues [
8]: (1) Most algorithms can only identify word-for-word plagiarism, while others can detect random alterations. Online PDs fail or lose efficiency at greater degrees of complexity [
9]. (2) Plagiarists have it easier with automatic translators, summarizers, and other tools. (3) Idea plagiarism detection tools are ineffective [
10]. (4) Most PD methods may not detect structural alterations [
11]. (5) Passage-level detections may lack linguistic, semantic, and soft computing tools. Syntactic, semantic, structural, and linguistic features must be evaluated to reveal hidden obfuscations. (6) Finally, there are not enough benchmark data to evaluate plagiarism techniques [
12]. Plagiarism can take place in two ways: (1) Literal plagiarism, in which the plagiarist uses all or part of another person’s work in their own. (2) Semantic plagiarism (intelligent) is when someone steals the content of another person’s work but uses different words to describe it.
Plagiarism can be as simple as copying and pasting or as complicated as changing the words around. See [
8] for more information. Textual documents can be divided into two basic types based on how similar their languages are or how different they are. These are monolingual and cross-lingual (CL) [
13,
14]. There are not many ways to find CL plagiarism because it is hard to find closeness between two text segments in different languages [
14]. Unlike its multilingual counterpart, monolingual plagiarism detection focuses on pairs of languages that are mutually exclusive, such as English and English. This kind of detection approach constitutes the vast majority [
14]. Detection may be further subdivided into the intrinsic type and the extrinsic type based on whether or not external references are used. Intrinsic detection is a document analysis technique that identifies potentially harmful files based only on linguistic features such as authorial style, paragraph structure, and section formulations [
8]. In extrinsic detection, the suspect document is compared to a database or collection of source documents.
Optimization is an interesting area of research. In general, there are two types of optimization solution methods: deterministic and stochastic methods. Every method has its own pros and cons [
15]. In deterministic methods, the initial values of the parameters and the conditions completely determine the model’s output. Some randomness is built into stochastic methods [
16]. Although various random approaches have been developed, such as swarm intelligence, genetic algorithms are becoming more popular for solving complex, large-scale optimization issues [
17]. The quantum genetic algorithm (QGA) is an innovative evolutionary algorithm that combines quantum computing with conventional genetic algorithm technology. The approach can solve the same types of problems as the traditional genetic algorithm, but it does it far more quickly because of quantum parallelization and the entanglement of the quantum state, which speeds up the evolutionary process. A global search for a solution may be performed with quick convergence and a small population size by combining the probabilistic mechanism of quantum computing with the evolutionary algorithm. These methods have proven effective in a broad range of combinatorial and functional optimization problems [
18,
19,
20].
1.1. Problem Statement
Even if that is true, putting plagiarism in a legal context is hard because you have to find strong proof that a suspicious text has been copied. When the text is copied and pasted word-for-word, it is usually enough to compare the suspect text to the possible source text to find the overlap. Most cases, though, are much more complicated. New ways to find plagiarism lead to new ways to avoid being caught, which in turn require new ways to find plagiarism. Plagiarism is when someone passes off someone else’s work as their own without giving credit. Plagiarism covers a wide range of things, from copying someone else’s words to copying someone else’s ideas. Recently, there have been many PD approaches based on semantic similarity and sentence-based concept extraction that may facilitate the discovery of paraphrases. To detect instances of plagiarism, several algorithms delve into the document’s semantic concept by analyzing factors like the author’s writing style, the structure of the paragraphs, the arrangement of the sections, etc. Obfuscated plagiarism cannot be prevented using these techniques, however.
1.2. Contribution and Methodology
In this paper, a modified PD algorithm is utilized to detect plagiarism using the semantic concept and the QGA. Adopting the QGA inside the PD model can facilitate the optimization of a similarity search. Furthermore, the QGA is employed to find sentences that briefly show the concept of the source document. On the other hand, semantic-level concepts are captured by applying semantic similarity metrics, which depend on the WordNet database for extracting semantic information. How successfully individuals are mapped to fitness metrics is what gives the QGA its usefulness in our context. Since all quantum individuals are reduced to a single solution during the measurement of the fitness function, the benefits disappear if the mapping is one-to-one. More individual-to-fitness mappings mean a higher potential diversity benefit for the QGA.
The remainder of this paper consists of the following sections: Some background on quantum genetic algorithms is briefly discussed in
Section 2. The third section provides a literature review of relevant publications for the PD framework. The suggested approach is presented in
Section 4. The assessment of the suggested technique, including results and discussion, is presented in
Section 5. The study is concluded, and possible future directions are discussed in
Section 6.
2. Preliminaries
In this section, we will go through the fundamental concepts of quantum genetic algorithms that will be used in the proposed framework. Primarily, evolutionary algorithms (EAs) are stochastic searches and optimization techniques inspired by the concepts of natural biological evolution. EAs have many advantages over more conventional optimization techniques, including their scalability, versatility, and independence from domain-specific heuristics. However, it is challenging to incorporate the characteristics of population diversity and selection pressure concurrently into EAs like the genetic algorithm (GA). In the face of rising selection pressure, the search narrows in on the best individuals in the population, but the resulting exploitation reduces genetic variety. The reason for this is that deterministic values are used in the definition of representations of EAs [
20,
21].
QGAs are a hybrid of conventional GAs and quantum algorithms. The superposition of quantum mechanical states, or “qubits”, is the primary foundation for these. Here, instead of being represented as a binary string, for example, chromosomes are vectors of qubits (quantum registers). This means that a chromosome may stand in for a superposition of all possible states. The QGA is distinguished by its simultaneous capacity for quick convergence and global search. Quantum computing concepts and principles like qubits and a linear superposition of states form the basis of the QGA [
22,
23]. One way to express the status of a qubit is as follows:
The probabilities of the qubit being in the ‘0’ and ‘1’ states are specified by the expressions
and
, respectively, where
and
are complex numbers describing the probability amplitudes of the two states. Information on the states of a system may be stored in a system of
m-qubits. However, a quantum state collapses to a classical one upon observation [
24]. For
m-qubits, the representation is:
Consider a three-qubits system with three pairs of amplitudes:
The current system status may be represented by:
This allows for eight possible states of information storage inside the three-qubit machine. Evolutionary computing with a qubit representation offers a more diverse feature than conventional approaches since it may express the superposition of states. While in classical representation at least eight chromosomes are needed to represent a state, just one qubit chromosome is needed to represent eight states. Convergence may also be attained using the qubit format. The qubit chromosome converges to a single state and loses its distinctive feature of diversity when either
or
approaches 1 or 0. Therefore, it is possible for the qubit representation to have both exploratory and exploitation properties [
24]. The structure of the QGA is described in Algorithm 1 [
21,
24].
Algorithm 1: QGA Procedure |
Begin
End
End |
The QGA maintains a population of qubit chromosomes,
at generation
t, where
n is the population size,
m denotes the total number of qubits and indicates the string length of the qubit chromosome, and
is the definition of a qubit chromosome:
where
is the
k-th state represented by the binary string
,
, is either 0 or 1, and
is the rotation angle. The effectiveness (fitness) of each solution is ranked. Then, among the available binary options, the
is chosen as the best possible starting point and saved.
uses the binary solutions and the best-stored solution to construct an updated solution, which is then processed via the relevant quantum gates
. To solve real-world issues, we may tailor the design of quantum gates to meet specific needs.
3. The State of the Art
Plagiarism often falls into one of three categories: (1) If the original texts are available, the study centers on comparing the suspect text(s) to the potential originals to uncover linguistic evidence to infer that the suspect text is truly a derivative or original; (2) if the source texts are unknown but plagiarism is suspected, the analysis focuses on determining whether the material in question is plagiarized or not based on its inherent stylistic evidence; or (3) if two or more texts are suspected of joint rather than individual composition, the linguistic study will center on determining whether any probable overlap between the texts is coincidental or the consequence of collaboration. Therefore, linguistic studies seek to determine whether instances of textual overlap across various papers are suggestive of plagiarism and if such overlap constitutes fraudulent behavior [
1,
2,
3,
4,
5].
To aid in the building of the suggested model, this section discusses a few related PD models and plagiarism prevention efforts from the cited literature.
Figure 1 shows the taxonomy of the existing PD models. In Ref. [
25], the authors developed an approach based on Semantic Role Labeling (SRL) to determine semantic similarity between texts. All of WordNet’s ideas were combined into one node called the “topic signature node,” which instantly captures suspicious elements from documents. This method identifies copy–paste and semantic plagiarism, synonym substitution, phrase restructuring, and passive-to-active voice changes. Hence, since not all arguments impact the PD process, the fuzzy inference system should be used to increase the similarity score that argument weighting improves.
In Ref. [
7], the authors studied sentence ranking for PD and SRL. Vectorizing the material generates suspicious and original sentence pairings. Pre-processing, candidate retrieval, sentence rating, SRL, and similarity detection are the five stages of the approach. The proposed technique leverages SRL to determine the semantic functions of each sentence word based on its verb. This depends on the word’s semantic meaning. The algorithm recognizes copy–paste, close copy, synonym substitution, phrase reordering, and active/passive voice conversion faster and more accurately. It is unknown what degree of syntax is required to provide a thorough study of semantic roles and how the state of the art constrains SRL tagging and parsing performance.
In Ref. [
26], the semantic and syntactic relationships between words are integrated. This strategy improves PD because it avoids picking source text sentences with high similarity to suspect text sentences with dissimilar meanings. It can identify copied text, paraphrases, sentence translations, and word structure changes. This approach cannot discriminate between active and passive sentences, however. In Ref. [
27], the authors suggested a fuzzy semantic-based similarity approach for detecting obfuscated plagiarism. After feature extraction, the text characteristics are entered into a fuzzy inference system, which models semantic similarity as a membership function. Once the rules have been evaluated, the results are averaged to obtain a single score that indicates how similar two texts are. The technology detected literal and disguised plagiarism. The system cannot generalize and is not resilient to topological changes. Such modifications need rule-based adjustments and an expert to develop inference rules.
Another approach was suggested in [
28] which treated document-level-text PD as a binary classification issue. The original source of a document was identified and that information was used to determine whether or not the document in question contained plagiarized content. The main parts are feature extraction, feature selection, and classification using machine learning. After pre-processing and filtering, part-of-speech (POS) tags and chunks removed extraneous data. The method investigated the influence of plagiarism categories and complexity on attributes and behavioral variances. The lack of a large database of manual plagiarism instances is a concern; thus, creating one is necessary for testing detection techniques.
The work in [
8] presented another effort to identify plagiarism. The described study explores GA syntax–semantics concept extractions to detect idea plagiarism. Pre-processing, GA source sentence extraction, document level, and passage level are the four major components. Natural language processing (NLP) approaches are utilized for word-level extraction within documents. Sentence-based comparisons employing integrated semantic similarity metrics are employed in the passage-level identification step. Using passage boundary conditions, the passage level is detected. In the offered technique, the concept of plagiarism enforced via summarizations is emphasized. The results demonstrated substantial performance in catching plagiarized texts. Plagiarism may also occur via elaboration and paraphrase, etc., which the system cannot detect.
In order to find instances of plagiarism, the study in [
29] constructed a cutting-edge system that relies on semantic properties. For each possible suspect and source phrase combination, the system generates a relation matrix that uses semantic characteristics to calculate the level of similarity. This study presents two weighted inverse distance and gloss Dice algorithms that illustrate different text qualities (e.g., synonyms) and develops a novel similarity metric for plagiarism detection, which overcomes the limits of the current features. In addition, this study examines the efficacy of individual characteristics in identifying copied works, combining the most effective ones by giving varying weights to their individual contributions to further improve the system’s performance. The inverse weighted distance functions have a drawback in that the function must have a maximum or minimum at the data points (or on a boundary of the study region).
The study given in [
30] outlines a three-stage process that, together, provides a hybrid model for intelligent plagiarism detection: initially, we cluster the data; then, we create vectors inside each cluster according to semantic roles, normalize the data, and compute a similarity index; and lastly, we use an encoder–decoder to provide a summary. For the purpose of choosing the words to be used in the production of vectors, K-means clustering, which is calculated using the synonym set, has been proposed as a method. Only if the last stage’s estimated value is greater than a threshold value is the following semantic argument evaluated. A brief description is generated for plagiarized documents if their similarity score is high enough. The experimental results demonstrated the effectiveness of the strategy in identifying not only literal but also connotative and concealing forms of concept copying. However, long sequences take a long time to process because of the slowness of the neural network’s processing and the difficulty of training it if activation functions are used. Finally, it has problems like gradient vanishing and explosions.
In Ref. [
31], the authors introduced an efficient method for determining the structural and semantic similarity between two publications by only analyzing a subset of the material of each document instead of the whole thing. To improve plagiarism detection regardless of word order changes, a collection of remarkable keywords and different combinations are used to compute similarity. The importance of a word varies depending on where in the article it appears. As a final step, a weighted similarity is determined using an AHP (Analytical Hierarchy Process) model. It was shown that the suggested method outperformed its competitors in terms of runtime and accuracy for detecting semantic academic plagiarism. One potential drawback of the AHP is the high number of pairwise comparisons it requires. This is due to the fact that comparing each criterion and then each option with regard to a given criterion is required.
In Ref. [
32], the authors offered an approach to detecting two common forms of paraphrased text: those that involve the use of synonyms and those that use the reordering of words in plagiarized sentence pairs. They introduced a three-stage technique that makes use of context matching and pertained word embedding to detect instances of synonymous replacement and word reordering. Their experiments revealed that the Smith–Waterman method for plagiarism detection combined with ConceptNet batch-pertained word embedding yields the highest scores. Methods to determine paraphrase styles for plagiarism detection may be used from this study to supplement similarity reports from existing plagiarism detection systems. Even though it is the most sensitive technique for detecting sequence similarity, the Smith–Waterman approach does not come without its price. Time is a major restriction, as conducting a Smith–Waterman search requires a lot of processing power and time.
Two methods for identifying external plagiarism are provided in [
33]. Both methods use a bag-of-words strategy-based two-stage filtering procedure, first at the document level and then at the sentence level, to reduce the search area; only the outputs of both filters are then evaluated for plagiarism. One uses the WordNet ontology and the term frequency–inverse document frequency (TF-IDF) weighting technique to create two structural and semantic matrices; the other uses a pre-trained network technique of words embedding fast text and TF-IDF weighting to create the same outcome. After forming the aforementioned matrices, the structural similarity of the weighted composition and the Dice similarity are used to determine the degree of similarity between the pairs of matrices representing each phrase. The similarity between the suspect text and the minimum criterion is used to classify documents as plagiarism or non-plagiarism. Using the PAN-PC-11 database, the authors conducted experiments to determine whether or not a word embedding network, as opposed to the WordNet ontology, would be more successful in detecting instances of extrinsic plagiarism. However, TF-IDF weighting does have certain restrictions. It may be time-consuming for large vocabularies since it calculates document similarity directly in the word-count space. It assumes that evidence for similarity may be found in the counts of various terms. One potential problem with the adaptable layout described above is that WordNet’s’ meaning and scope might quickly diverge from one another. We cannot be sure that we will be encoding the same relationships or that we will be covering the same conceptual ground [
34,
35].
In Ref. [
36], the authors created a new database that contains all the characteristics that indicate various linguistic similarities. As a solution to textual plagiarism issues, the developed database is offered for use in intelligent learning. The produced database is then used to propose a deep-learning-based plagiarism detection system. During development, many deep learning techniques, including convolutional and recurrent neural network topologies, were taken into account. To assess the efficacy of the presented intelligent system, comparison research was conducted using the PAN 2013 and PAN 2014 benchmark datasets. In comparison to state-of-the-art ranking systems, the test findings demonstrated that the suggested system based on long short-term memory (LSTM) ranked first. However, LSTMs are easy to overfit and are sensitive to different random weight initializations.
Using the fuzzy MCDM (multi-criteria decision-making) technique, the research in [
37] compared and contrasted many academic plagiarism detection strategies and offered guidelines for creating effective plagiarism detection tools. They described a framework for ranking evaluations and analyzed the cutting-edge methods for detecting plagiarism that may be able to overcome the limitations of the state-of-the-art software currently available. In this way, the research might be seen as a “blueprint” for developing improved plagiarism detection systems. An innovative and cutting-edge technique known as compressive sensing-based Rabin Karp is offered for use in the system presented in [
38]. This technique calculates both syntactic and semantic similarities between documents using a sampling module to shrink the dataset and a cost function to identify document repetition. Yet, simply applying the hash function based on the generated table may result in cases where the hash codes for the pattern and text are the same, yet the pattern’s characters do not match those in the text. For current surveys that include the most up-to-date research in the plagiarism detection area, please refer to [
39,
40].
A novel plagiarism detection approach is presented in [
41] to extract the most useful sentence similarity features and build a hyperplane equation of the chosen features to accurately identify similarity scenarios. The first phase, which contains three steps, is used to pre-process the papers. The second phase is dependent on two different strategies: the first strategy relies on the standard paragraph-level comparison, while the second strategy relies on the calculated hyperplane equation utilizing Support Vector Machine (SVM) and Chi-square methods. The best plagiarized segment is taken out in the third step. On the whole test corpus of the PAN 2013 and PAN 2014 datasets, the recommended approach attained the best values of 89.12% and 92.91% of the Plagdet scores and 89.34% and 92.95% of the F-measure scores, respectively.
The present plagiarism detection solutions now on the market compare plagiarism only when the input document includes text, despite the fact that there are a number of tools available that address the issue of plagiarism using various methodologies and features. However, when the input document is an image, the techniques currently in use do not check for plagiarism. The authors in [
42] suggested a tool that searches both the text and text hidden in images using an exhaustive searching approach. The project’s suggested tool compares the input document’s content to that of websites and returns findings on how similar they are. The source and suspect papers are in two different languages, making it difficult to identify cross-lingual plagiarism (CLP). In this context, a number of solutions to the issue of CPD in text documents were proposed. To obtain comparability metrics, the authors in [
43] employed the one-gram and tri-gram of the pre-processed text. The models are constructed using five ML classifiers: KNN, Naive Bayes, SVM, Decision Tree, and Random Forest. The trial demonstrates that KNN, RF, and other models offer superior outcomes versus other models.
Commercial plagiarism detection tools are accessible online for purchase or subscription. EVE2, Plag Aware, Write Check, Turnitin, and Ithenticate are some of the most well known [
44]. Turnitin is an online similarity detection service that compares submitted papers to various databases using a proprietary algorithm to check for possibly plagiarized material. In addition to scanning its own databases, it has licensing arrangements with significant academic private databases. Turnitin does not deal with the causes of academic integrity problems, and so it does not fix them. Instead, it might give students the impression that they are being held accountable for cheating from the very first day of class or that their work is being used against them and others without their permission. iThenticate is a plagiarism prevention tool that assesses written material (such as journal article manuscripts, proposals, research reports, theses, and dissertations, among other things) against millions of published works that are accessible online and via paid databases. The following are some benefits of iThenticate: The finest tool for detecting plagiarism in academic writing is iThenticate, which employs cutting-edge algorithms to evaluate submitted text against a huge library of scholarly publications.
Despite decades of study, PD might be strengthened to better prevent intellectual property theft. Still, PD should account for things like running time and computational complexity. The available PD approaches are not all suitable to be employed in all applications. To address these issues and outperform competing methods, a model combining semantic idea extraction and the QGA for optimizing similarity search has been proposed. The QGA is structurally similar to classical genetic algorithms, with the exception that quantum gates and quantum superposition are used to construct the initial and updated populations, with consideration given to the adaptation of such operators to meet GA-based PD issues. One clear benefit of a QGA is that its population tends to be more diverse than that of a non-QGA. To put it another way, a quantum population may be exponentially greater than its “size” in the classical world. Only one possible solution may be represented by each individual in a classical population. Each “individual” in a quantum population is a superposition of many different possible solutions. In this sense, the population of a quantum system is far greater than that of a classical system.
6. Conclusions
From the standpoint of a forensic linguist, it is critical to determine with absolute certainty whether a text is an original or the consequence of plagiarism. Expert evidence from a forensic linguist is often required in court cases, but this field is not only concerned with law; forensic linguists also study public-facing topics. Therefore, incorrect judgments must be avoided at all costs to avoid miscarriages of justice, whether in the classroom or the courtroom. In this paper, a new approach based on the semantic similarity concept and the QGA for PD is proposed. The proposed model includes four main steps: the pre-processing and document representation module, sentence-level concept extraction using the QGA, the document-level detection phase, and the passage-level detection phase.
The semantic similarity concept, which depends on intelligent techniques, is employed for extracting the concepts from documents in an effective way to enhance the model’s performance. The QGA is employed to find relatedness between sentences that show the concept of the source document briefly, enhancing the model’s processing time. The solution based on PDS has the advantage of detecting plagiarized ideas in documents presented via summarization.
The proposed model was evaluated by using samples of benchmarked datasets. Based on the obtained results, the proposed model for the detection of plagiarism shows an excellent performance in terms of accuracy. It has been compared with the HGA and the GA-based PD model, and it has come up with better results against them. The QGA has been proven to provide better results in terms of accuracy without adding any complications to the model. The solution’s shortcomings, such as WordNet’s inability to measure all possible semantic relationships between words, reduce its efficiency. Despite the method’s general effectiveness, there are other methods to implement the idea, such as paraphrasing and expanding upon concepts. A possible future study includes making use of a different database to determine how closely related terms are semantically. Furthermore, future work will focus on comparing different QGA strategies to study the effect of choosing rotation gate angles. Another perspective of this work is to study parallel QGAs because QGAs are highly parallelizable.