This section presents an extensive discussion about the outcomes of the systematic analysis, structured into four key sections: “Article Selection”, “Research Questions Findings”, “Quality Assessment”, and “Research Trends and Key Findings”. In “Article Selection”, the step-by-step process used to select articles is elucidated, highlighting the academic database sources utilized, the filtering process, and the methodology employed. “Research Questions Findings” presents an analysis and findings of the selected articles in light of the defined research questions. “Quality Assessment” evaluates the quality of the included studies using established criteria and discusses both the qualitative and the quantitative analysis of the selected articles. Finally, “Research Trends and Key Findings” highlights emerging trends and significant insights drawn from the comprehensive examination and analysis of the selected studies.
4.2. Research Questions’ Findings
RQ1: What innovative patent retrieval techniques have been developed and utilized for effective patent analysis?
A wide range of techniques have been proposed in the literature, focusing on either general patent retrieval or retrieval specifically aiding prior art searches. These patent retrieval techniques can be divided into five (5) main categories:
Bibliometric-based methods
Bibliometric-based methods, in the context of patent retrieval, leverage patent bibliographic data and citation data to enhance the effectiveness of the patent retrieval and analysis process. Bibliographic data include the invention’s title, inventor(s), assignee(s), filing date, publication date, priority date, and patent classification codes, whereas citation data comprise information on the references cited within a patent document (backward citations) and references that cite the given patent document (forward citations) [
39]. Extracting citations from patent manuscripts is challenging due to the lack of a standard format for patent references [
8]. For citation analysis, citation graphs and temporal patterns are commonly employed.
- a.
Citation Graph/Network: It is a graphical representation of relationships between patent documents by analyzing their citations. This method aids in comprehending patterns, influences, and trends within a specific area of study [
39,
40].
- b.
Temporal Patterns: These are a type of citation analysis that utilizes backward citations to show how patents cite prior works over time, providing insight into technological advancements in a particular field [
41].
General Information Retrieval-Based Techniques
General information retrieval-based techniques employ principal Information Retrieval (IR) methods to search and retrieve patent documents from patent databases using methods such as keyword-based searching, Boolean queries, and relevance ranking algorithms. These include:
- a.
Collection Selection Method: This method helps to decide which subsection of patent documents or patent data sources should be employed during the search or analysis. The selection of pertinent data sources is based on already set parameters such as relevance to the topic, and data quality [
42,
43];
- b.
Federated Search: It enables simultaneous search through several patent databases, thereby, consolidating outcomes into one interface for swift retrieval. This method accelerates the overall information retrieval process and saves time on searches [
44,
45];
- c.
Query Expansion/Construction: It is a process of refining (adding or removing terms to fine-tune), expanding (adding terms to broaden the scope), or augmenting (adding terms to enrich coverage of a topic) the original search query to increase the inclusiveness and relevance of search outcomes [
39,
46,
47,
48];
- d.
Graph-Based Techniques: These techniques leverage graph theory to represent and examine patent documents as a network of connected nodes in a graphical layout, enhancing context-aware search, retrieval, and analysis [
39,
48,
49];
- e.
Dynamic Ranking: This method iteratively re-ranks search results to better reflect shifting user preferences and situations. Used by recommendation engines and search engines, dynamic ranking adaptively reorders search results to provide users with the most current and relevant content [
50,
51].
Machine Learning-Based Techniques
Machine learning-based techniques, in the context of patent retrieval, recognize patterns and predict document relevance to enhance patent document search, categorization, and analysis [
52].
- a.
Recommendation Based Retrieval: This method provides tailored recommendations to users based on their search queries by utilizing methods from information retrieval, data mining, and machine learning [
41,
53];
- b.
Nearest-neighbor (NN) technique: This technique scans a set of patent documents to identify patents (nearest neighbors) that resemble a given query patent based on certain features. The similarity metric is typically determined by the distance between the feature vectors of the patent documents [
54];
- c.
Dimensionality reduction model: This process renders a more manageable dataset for research or representation by reducing high-dimensional data to a lower-dimensional space while retaining key features (important information) [
55];
- d.
Ensemble method: This method integrates several models to enhance the overall effectiveness of the patent retrieval system. It involves training multiple models individually and then combining their predictions to produce an improved final result [
56];
- e.
Fuzzy Logic: This technique facilitates adaptable decision-making in situations with incomplete information by accommodating ambiguity and imprecision in patent retrieval operations [
47];
- f.
Genetic Algorithm: This approach iteratively modifies search query parameters, including keyword combinations, and weights to better reflect the user’s intent and produce more effective and comprehensive search results [
57,
58];
- g.
Heuristic Meaning Comparison: This method utilizes patterns and similarities found in patent documents to provide domain-specific insights in the form of rules to steer the search process. These approaches can be used to find synonyms, related concepts, or typical patent structures that could aid in finding more relevant patent documents [
59];
- h.
Deep Learning Techniques: These techniques use deep learning models, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), to identify complex structures and semantic correlations within patent documents to improve overall search results [
60,
61,
62];
- i.
Clustering: This process organizes patent documents into relevant clusters based on shared characteristics in their citation networks, metadata, or content, to aid in managing large patent databases. Additionally, it also facilitates the exploration and navigation within the network, in order to locate relevant patent documents [
63,
64,
65].
Natural Language Processing-Based Techniques
- a.
Topic Modeling: Algorithms such as Latent Dirichlet Allocation (LDA) are used to identify generic topics or concepts in a group of patent documents by examining the density of words or terms. These algorithms categorize patent documents into topic clusters using word co-occurrence patterns, facilitating easy investigation and comprehension of key topics and trends within patent documents [
61,
66];
- b.
Contextual Modelling: These modeling techniques are employed to comprehend and generate patent textual data, analyzing technical language and domain-specific vocabulary, and to interpret search requests. This enhances query comprehension, text summarization, and relevance scoring, ultimately improving the effectiveness of patent search and retrieval processes. Contextual modelling includes approaches such as n-gram models, Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-trained Transformer (GPT) [
40,
44,
53,
61,
66,
67,
68,
69];
- c.
Semantic Analysis: This process examines the relationships between words, phrases, and sentences to comprehend and interpret the semantics (meaning) of input text. Methods used include named entity recognition, sentiment analysis, and semantic similarity [
48,
56,
57,
70,
71,
72];
- d.
Statistical Analysis: This technique uncovers the statistical characteristics and connections found in textual data. It supports tasks such as language modeling, information retrieval, and topic extraction by analyzing statistical traits present in the data. Included techniques are Language Models (LM) using Dirichlet smoothing and Jelinek–Mercer smoothing, Term Frequency–Inverse Document Frequency (TF–IDF), Latent Dirichlet Allocation (LDA), and n-gram models such as unigrams, bigrams, and positional information [
39,
46,
55,
58,
63,
73,
74,
75,
76,
77,
78,
79,
80,
81,
82,
83];
- e.
Text Processing: This process enables machines to examine text, understand its structure and meaning, and extract valuable information for complex tasks such as document classification and machine translation [
51,
62,
84,
85,
86,
87];
- f.
Similarity Measure: This is a comparability metric used to find similar texts or documents. It encompasses methods such as approximate nearest-neighbor techniques (identifying similar items using vector representations) and Kullback-Leibler Divergence (which measures the variation in probability distributions) [
54,
88];
- g.
Tools: Various tools like TreeTagger, OpenNLP, and Stanford CoreNLP are effective for natural language processing tasks. They offer functionalities such as tokenization, part-of-speech tagging, lemmatization, named entity recognition, sentence segmentation, and more, to conduct in-depth linguistic analysis and extract relevant information from textual data for various applications [
89,
90,
91].
Semantic Analysis-Based Techniques
- a.
Semantic Trees: This technique organizes patent documents in hierarchical structures to illustrate semantic relationships between them as well as to facilitate systematic traversal through related patent documents [
71];
- b.
Cosine Similarity: This measures and quantifies the degree of similarity between two patent documents’ vector representations. Useful for document matching, grouping, and relevance ranking in patent retrieval systems, a high cosine similarity index suggests a higher level of similarity between documents in terms of content or attributes [
90].
Other techniques
- a.
Claim Tree Structure: This represents the hierarchical relationships, including comprising, linking, prepositions, and verbs, between the various components of a patent claim in a tree-like structure to provide a thorough understanding of the claim’s semantics and relationships. Each component of the claim can be a single word (unigram) or a phrase (n-gram) [
90];
- b.
Patent Ontology: This structured framework systematically arranges and represents patent-related data, such as patents, inventors, and classifications, and their relationships. It facilitates effective retrieval, analysis, and comprehension of patent data by providing a common vocabulary with shared semantics [
92,
93].
Table 6 illustrates the categorization of the studies according to their various techniques. It shows that 37 studies utilized general information retrieval methods, while 56 studies deployed natural language processing-based techniques. Additionally, 15 studies made use of machine learning for patent retrieval techniques, 7 studies explored bibliometric-based techniques, and 3 studies applied other techniques. It should be noted that various studies employed more than one technique to enhance the outcome of their experiments.
Further analysis with the help of
Figure 4 shows that query expansion/construction is the most used approach in the information retrieval category. Similarly, the semantic tree-based approach and cosine similarity-based approach have both been utilized once in the semantic analysis category. Furthermore, NLP-based approaches have also been widely exploited.
RQ2: How effective are the patent prior art retrieval techniques?
When undertaking a patent retrieval task with a specific patent, whether for a new application of an innovative idea, the objective is to identify patents within a database that are pertinent to the referenced patent. This process entails a detailed examination of the patent’s claims, descriptions, and technical sphere to formulate search strategies capable of pinpointing similar inventions. The goal is to discover patents with shared technological, functional, or inventive traits, thereby providing a holistic view of the related prior art. Such an extensive search identifies not only exact matches but also patents with sufficient similarities to be deemed relevant, enriching the understanding of the patent environment surrounding the new invention. This method strives to optimize the identification of relevant patents (true positives, TP) and minimize the misidentification of irrelevant patents as relevant (false positives, FP) or the oversight of pertinent patents (false negatives, FN), ensuring the collection’s most relevant patents are precisely retrieved. Recall, precision, F1-score, Mean Average Precision (MAP), and accuracy are performance measures utilized in the literature for patent retrieval tasks. The definitions of the different parameters are given below, with
Section S3.1 in the Supplementary Material providing a more thorough discussion on the topic [
100,
101,
102,
103,
104].
Very few studies have shared comprehensive performance parameters pertinent to their proposed patent retrieval techniques. Different studies have employed different performance measures to evaluate the efficacy of their patent retrieval approaches. Some studies have focused solely on Mean Average Precision (MAP), while others have prioritized maximizing recall and enhancing precision by minimizing false positives. Additionally, several studies employ a variety of measures to provide a comprehensive evaluation of their methodologies, including precision, recall, F1 score, and others. The selection of the performance measures usually depends on the unique objectives of the research, the specific features of the dataset, and the anticipated application of the retrieval system.
Table 7 summarizes the recall, precision, MAP, F1 score, and accuracy as reported by various studies. References [
50,
51] reported a perfect recall rate by utilizing general information retrieval-based dynamic ranking techniques. The best precision of 99% was achieved using a nearest-neighbor-based technique [
54]. The top MAP score of 92% was reported using an ML-based patent retrieval technique [
54], while the highest F1 score of 95% has been obtained using an NLP-based technique [
62]. Only a handful of studies have reported the accuracy of their proposed patent retrieval techniques, with the highest accuracy of 96% obtained using an ensemble-based retrieval method [
56].
RQ3: What are the most widely used patent data collections?
Researchers must consider various factors when selecting databases for patent search and retrieval processes. Key considerations include availability, open access, and ease of use of the database. The size of the database is crucial as larger databases offer broader coverage of prior art, enabling more comprehensive searches. These extensive datasets are also vital for training machine learning and natural language processing models. Specific dataset parameters, such as publication dates, patent classification codes, and keywords, are essential for refining search queries and effectively filtering results. These parameters allow researchers to target specific time periods and technical fields effectively. Additionally, precise annotations or classifications within the database enhance the efficiency of the retrieval process. Below is a list of data collections or datasets that have been used for evaluating patent retrieval systems in the surveyed research articles:
The CLEF-IP (
http://ifs.tuwien.ac.at/~clef-ip/ URL (accessed on 29 March 24)) (Conference Labs of the Evaluation Forum—Intellectual Property) is an Intellectual Property (IP) track under the umbrella of CLEF (
https://www.clef-initiative.eu/ URL (accessed on 29 March 24)) (Conference and Labs of the Evaluation Forum), which is a European series of workshops that commenced in 2001 to promote research on cross-language information retrieval (CLIR). CLEF-IP was conducted between 2009 and 2013 to evaluate the performance of patent retrieval (PR) systems. It offers datasets for various activities, including prior art searches and patent classification. The CLEF-IP data collection encompasses patent documents gathered from USPTO, EPO, and WIPO, presented in XML format with a common DTD structure. The documents include sections such as bibliographic data, abstracts, descriptions, and claims, often in multiple languages (English, German, and French), as required by the European Patent Office (EPO) for granted patents. This collection is organized in corpus and topic pools to aid the information retrieval community in comparing the efficacy of various information retrieval techniques [
8,
123]. A more detailed explanation of the variants of the CLEF-IP dataset is given in
Table S2 in Section S3.2 of the Supplementary Material [
8,
123,
124,
125,
126].
Numerous studies [
39,
42,
43,
45,
46,
48,
49,
56,
66,
73,
74,
79,
82,
83,
85,
87,
89,
94,
95,
96,
97,
98,
99,
100,
101,
102,
103,
104,
105,
106,
107,
108,
109,
110,
111,
112,
113,
114,
115,
116,
117,
118] have utilized CLEF-IP as their dataset for experimentation and verification.
The NTCIR (
https://research.nii.ac.jp/ntcir/ntcir-12/index.html URL (accessed on 29 March 24)) (NII Test Collection for Information Retrieval Research) workshop was first initiated in 1997 by the Japanese National Institute of Informatics to promote research in information retrieval (IR) and related fields, with a specific emphasis on cross-language information retrieval (CLIR) [
8]. The Patent Retrieval Task aims to offer test sets for research on patent information processing, including retrieval and mining [
127].
Table S3 in Section S3.2 of the Supplementary Material explains patent retrieval tasks in NTCIR-3, NTCIR-4, NTCIR-5, and NTCIR-6 [
8,
127,
128,
129,
130].
Only one study [
84] utilized the NTCIR-6 dataset.
The TREC (
https://trec.nist.gov/overview.html URL (accessed on 29 March 24)) (Text Retrieval Conference), initiated in 1992 and jointly sponsored by the National Institute of Standards and Technology (NIST) and the United States Department of Defense, is a widely recognized forum for evaluating information retrieval methods. Specifically, the TREC-CHEM track focuses on chemical patent retrieval to promote and stimulate research on chemical datasets.
Table S4 in Section S3.2 of the Supplementary Material elaborates on patent retrieval-related tasks in TREC-CHEM 2009, TREC-CHEM 2010, and TREC-CHEM 2011 [
8,
131,
132,
133].
Two studies [
58,
75], reported to have utilized the TREC dataset.
The EPO (
https://www.epo.org/en URL (accessed on 29 March 24)) (European Patent Office), set up in 1973, is an international organization with 39 member states that is responsible for granting patents in Europe. EPO applications must be submitted in English, French, or German [
134]. The EPO provides a valuable source of patent data for researchers to perform tasks related to patent retrieval and analysis, with the EPO dataset including only patents granted by the European Patent Office (EPO) since 1978 [
135]. Studies such as those in references [
88,
117] have utilized the EPO dataset; reference [
88] utilized a subset of EPO patents consisting of two million English, one million French, and one million German patents, each indexed individually for claims and descriptions, while reference [
117] analyzed over a million patent applications filed with the European Patent Office (EPO) between 1982 and 2005, in conjunction with more than 20 million PubMed documents published before the beginning of 2011.
Google Patents (
https://patents.google.com/ URL (accessed on 29 March 24)) provides both a search engine and public datasets, allowing users to browse and search an extensive repository of patent-related information. The Google Public Patent Database provides a huge warehouse of patent-related data and information for in-depth research and analysis and includes patents issued by 17 different patent offices, including the United States Patent and Trademark Office (USPTO) and the European Patent Office (EPO). Moreover, it features interlinked database tables to enable data-driven investigations of patent analysis and retrieval-related tasks [
136]. Five studies [
60,
61,
66,
111,
115] have utilized Google Patents.
Chinese patents are managed and granted by the China National Intellectual Property Administration (CNIPA (
https://english.cnipa.gov.cn/) URL (accessed on 29 March 24)). These patents are accompanied by datasets that provide significant information for research and analysis. CNIPA datasets include bibliographic information, publishing data, and legal data for patents, utility models, and designs from 1985 to the present. Moreover, documents can be searched using a range of search parameters, such as application number, publication number, publication date, applicant or patent holder, priority, patent agency, class, title, and abstract [
137]. Only one study [
121] has utilized the Chinese patents.
The IBM Almaden (
https://research.ibm.com/labs/almaden URL (accessed on 29 March 24)) provides a collection of more than 13 million patents stored in an Oracle Database Management System for Novartis, as part of a collaboration project between Novartis/NIBR-IT and IBM. These patents, which cover various application areas associated with life and health sciences, can be retrieved using SQL queries and are stored in an XML format. Reference [
116] utilized the IBM Almaden data.
Indian patents are managed and granted by the Council of Scientific & Industrial Research (CSIR (
https://www.csir.res.in/) URL (accessed on 29 March 24)) and the Controller General of Patents, Designs & Trade Marks (CGPDTM (
https://ipindia.gov.in/patents.htm) URL (accessed on 29 March 24)). Patestate (
https://www.patestate.com/ URL (accessed on 29 March 24)) is an online database encompassing CSIR (Council of Scientific & Industrial Research) granted patents. The Indian patent database provides researchers with useful patent-related information, including application numbers, filing and publication dates, invention titles, international classifications, priority document details, applicant and inventor names, and abstracts [
138]. Three studies [
76,
77,
80] have utilized the Indian patents.
MAREC (
https://www.ifs.tuwien.ac.at/imp/marec.shtml URL (accessed on 29 March 24)) is a large repository of over 19 million patent applications and granted patents taken from EP, WO, US, and JP databases spanning the years 1976 to June 2008. MAREC includes documents in multiple languages, including English, German, and French, with a majority being full-text documents. It facilitates research and analysis in domains that include information retrieval, natural language processing, and machine translation. Documents from numerous countries are standardized into XML format with a common citation style and patent numbering scheme. Standardized attributes include names of individuals, names of companies, dates, countries, languages, references, and detailed subject classifications. The MAREC collection consists of 19,386,697 XML files, totaling 621 GB. Two studies [
44,
94], have utilized the MAREC dataset.
Primarily, the National Institutes of Health (NIH) (
https://www.nihlibrary.nih.gov/nih-subject/patents URL (accessed on 29 March 24)) funds research and conducts studies. NIH dataset includes information (title and abstract) on extramural grants, contracts awarded, grant applications, NIH-supported organizations, NIH-funded scholars and interns via NIH programs, and biomedical manpower, covering the period from 2007 to 2010. Only one study [
54], has utilized NIH dataset.
The PatentsView (
https://patentsview.org/ URL (accessed on 29 March 24)) patent database is an extensive resource and an open data platform developed in partnership with the United States Patent and Trademark Office (USPTO) in 2012, focusing on data related to intellectual property (IP). This database offers features such as patent visualizations, community collaborators, an API tool, a data query builder, and bulk data download, enabling a thorough investigation and evaluation of intellectual property data. The PatentsView database offers bulk downloadable patent metadata as well as comprehensive details on granted patents, as individual files in a tab-delimited format for programmers and researchers. It has several tables containing data on applicants, assignees, attorneys, classifications, examiners, inventors, citations, and more. Only one study [
68] has utilized the PatentsView dataset.
The Russian Patent (
https://rospatent.gov.ru/en/products_services/search_system URL (accessed on 29 March 24)) database includes patents on various technological innovations, granted within the Russian Federation by the Russian Patent and Trademark Office (Rospatent). The Federal Service for Intellectual Property (FIPS), which is part of Rospatent, provides information on patents (encompassing inventions as well as utility models) including bibliographic data, abstracts, descriptions, claims, drawings, and legal status. The information is presented in the Russian language, but abstracts have been translated into English for broader accessibility [
139]. Three studies [
71,
91,
99] have utilized Russian Patents.
The United States Patent and Trademark Office (USPTO (
https://www.uspto.gov/) URL (accessed on 29 March 24)) is a government office that grants patents and registers trademarks in the United States. The USPTO database enables inventors, researchers, and corporations to obtain patent-related data through search tools, including legal status information and access to full-text documents. These datasets provide comprehensive information on a range of intellectual property-related topics.
Table S5 in Section S3.2 of the Supplementary Material describes the different USPTO research datasets [
140].
Various studies have utilized the USPTO databases in their research [
41,
51,
53,
55,
59,
62,
63,
67,
70,
71,
72,
73,
74,
75,
76,
77,
78,
79,
80,
81,
90,
91,
92,
93,
95,
96,
105,
109,
110,
111,
112,
120].
The State Intellectual Property Office of China manages the Traditional Chinese Medicine (TCM) (
https://tcmsp-e.com/tcmsp.php URL (accessed on 29 March 24)) patents database, established in 2001. The TCM Patent Database enables patent examiners to easily search TCM-related patents by providing access to over 19,000 bibliographic data and 40,000 TCM formulas. This database provides several search options, including quick search, advanced search, TCM formula search, and search history [
141]. Only one study [
64] has utilized the TCM patents,
As patentability requirements can also be influenced by non-patent data, several researchers have also utilized non-patent datasets in their studies.
The PubMed (
https://pubmed.ncbi.nlm.nih.gov/ URL (accessed on 29 March 24)) Library, administered by the National Library of Medicine, is a vast database of biomedical literature sourced from scientific journals, research articles, and books. It facilitates the search and retrieval of biomedical and life sciences literature with the intent of improving health. This database includes over 36 million citations for biomedical literature, collected from MEDLINE. MEDLINE, created by the National Library of Medicine (NLM), is a prominent bibliographic database of biomedical literature that includes citations to journal articles in the biological sciences covering medicine, nursing, dentistry, veterinary medicine, health care systems, and preclinical sciences. These citations may include links to full-text articles from the publisher and PubMed Central. Two studies [
86,
114] have utilized PubMed data.
Two non-patent datasets are used in the selected literature to assess the patent search and retrieval processes. The first, the Yeast (
https://archive.ics.uci.edu/dataset/110/yeast URL (accessed on 29 March 24)) dataset, was created in 1996 and is used in biology to predict the cellular localization sites of proteins. This dataset serves classification tasks and has been employed in two studies [
50,
108]. The second dataset, the 20 Newsgroups (
https://www.kaggle.com/datasets/crawford/20-newsgroups URL (accessed on 29 March 24)) dataset, comprises approximately 20,000 documents across 20 diverse newsgroups. It covers a broad range of topics from computer graphics and hardware debates to sports, politics, and religion. This dataset is particularly valuable for evaluating machine learning techniques in text-based applications such as text classification and clustering, as demonstrated in the same two studies [
50,
108].
Two datasets have been overwhelmingly used by about 70% of our surveyed studies. A total of 25 studies made use of the Cross-Language Evaluation Forum for Intellectual Property (CLEF-IP) dataset and another 26 studies used the dataset provided by the United States Patent and Trademark Office (USPTO), as shown in
Figure 5. The remaining datasets have been used much less frequently compared to the previously reported two datasets. The Google Patents dataset has been used by five studies, whereas three studies have reported their studies using the Indian Patents dataset, and another three have utilized the Russian Patents dataset.
Table 8 shows more details of the CLEF-IP tracks. CLEF-IP 2011 has been widely used and has been reported by 12 studies, whereas CLEF-IP 2012 has just been used by two studies [
82,
87]. The table caters for the cases where a single study uses more than one CLEF-IP track. For instance, if a study uses both CLEF-IP 2009 and CLEF-IP 2010, each is reported separately in the table.
The preferred databases for prior art searches in the selected studies appear to be USPTO and CLEF-IP. There could be a few possible reasons for their selection. One potential reason for the USPTO could be its comprehensive coverage of patents, incorporating a wide array of inventions. Moreover, the USPTO enjoys a remarkable reputation amongst numerous industries, making it a significant resource for research and analysis. Lastly, the USPTO provides a broad spectrum of patents and patent applications filed in the United States, making it a rich source for prior art document searches. The likely explanation for favoring CLEP-IP could be that it provides a well-curated and organized set of patents intended solely for patent prior art search tasks. Additionally, this dataset offers a well-annotated and standardized collection of patent documents, making the evaluation of prior art search and retrieval systems easier and more efficient.
RQ4: What is the impact of semantic search and natural language processing on the patent retrieval process?
Semantic search leverages natural language processing (NLP) techniques to interpret and comprehend the intent of the search query and the content of the documents being searched, facilitating the identification of more relevant and accurate results. The integration of semantic search and natural language processing (NLP) in the domain of patent retrieval enhances the efficiency and precision in locating relevant patent documents for researchers, inventors, and patent examiners. This combination enables a more refined matching process, capable of ranking complex-natured patent documents according to their semantic similarity with the search query by considering the contextual understanding of the content, which includes technical terminologies, legal jargon, and specialized acronyms.
Figure S2 in Section S3.3 of the Supplementary Material summarizes the overall NLP-based techniques reported to have been used in various surveyed research, while
Table 9 shows the details of the NLP models used. For instance, SBERT is one of the most widely deployed BERT variants. Among embeddings, word embedding is the most popular embedding technique, with eight studies reported to have used it. Another significant finding from the survey is that almost all studies published from 2019 onward have incorporated NLP as their main patent retrieval technique. Moreover, the majority of those studies have included embeddings to aid the retrieval process.
Table S6 in Section S3.3 of the Supplementary Materials offers a detailed overview of the natural language processing (NLP) techniques employed in the selected studies, providing a concise description of each method and its application within the research context [
142,
143,
144,
145,
146,
147,
148,
149,
150,
151,
152,
153,
154,
155,
156,
157,
158,
159,
160,
161,
162,
163,
164,
165,
166,
167,
168,
169]. A total of 57 studies reported to have used NLP-based techniques, and 55 studies utilized semantic or contextual searches for patent retrieval tasks. Furthermore, 21 studies employed embeddings such as word, sentence, passage, and document embeddings to retrieve the patents of interest. This information has been summarized in
Figure 6.
RQ5: Which part of the patent document are widely used for prior art searches?
Patents are intricate legal and technical documents, incorporating a blend of technical terminologies, legal jargon, and specialized acronyms. A patent document consists of several component sections, each serving a specific purpose and holding a different level of significance.
Table S7 in Section S3.4 of the Supplementary Materials delineates the various components of a patent document, detailing their specific functions and providing explanations to enhance understanding of their roles within the overall structure of a patent.
Several studies have used complete patent documents, while others have utilized combinations of different sections, or in some cases, a single section of the patent document to retrieve patents of interest.
Figure 7 summarizes the findings of the survey. A total of 14 studies have used the whole patent document to aid in the patent retrieval process, while 12 studies have utilized the title, abstract, claims, and description sections of the patent document to perform prior art and other searches. The abstract and title are also a popular combination reported to have been used by seven studies. Some studies [
60,
122] have made use of the title only to perform their searches. However, a total of eight studies did not specify the part of the patent document they used to perform prior art searches.
The preference for using full patent documents to assess the patent retrieval systems may enable a thorough grasp of the context, tackle varying levels of detail, and understand the full scope and its implications, but this requires a greater amount of time and resources. On the other hand, using specific parts of the patent documents, such as abstracts, claims, descriptions, or titles to train and evaluate a patent retrieval system is an efficient and effective way to comprehend and make use of the important information contained in the patent documents, as they offer relevance by focusing on crucial details about the invention. This method also makes system evaluation and initial relevance ranking faster. Nevertheless, selective use may overlook the contextual depth and crucial information found in full patent documents. Therefore, there should be a smart selection of specific parts of the patent documents to evaluate the retrieval systems that strike a balance between efficiency and comprehensiveness. Claims (legal scope), abstracts (concise summary), titles (short explanation of subject matter), descriptions (detailed technical information), and citations (intellectual lineage) are the sections of patent documents that can be most relevant for evaluating retrieval system performance, by considering both technical details and intellectual context.
Classification systems like the International Patent Classification (IPC) and the Cooperative Patent Classification (CPC) greatly improve patent searches by offering a standardized framework for classifying and extracting patent documents according to specific technological domains. The IPC, managed by WIPO, is used globally and involves a broad classification into sections, classes, subclasses, groups, and subgroups. It provides a systematic approach that facilitates straightforward classification and retrieval of patents, which is especially useful in traditional search environments. On the other hand, the CPC, a joint initiative by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO), builds on the IPC with more detailed classifications, containing over 250,000 categories compared to the IPC’s 70,000. This allows for even more precise and granular search outcomes, enhancing the relevance and accuracy of searches. However, while the IPC is updated every five years, the CPC system is updated monthly, reflecting its capacity to adapt more quickly to technological advancements. Nevertheless, the efficacy of these systems is contingent upon the precision of the examiners’ initial classification, which can be prone to errors. Despite these challenges, both IPC and CPC codes are invaluable resources for researchers and practitioners to perform patent searches, with the CPC offering a deeper level of detail for comprehensive investigations.
RQ6: What open challenges does prior art patent search and retrieval face?
Numerous studies have identified several key challenges in prior art searches and patent retrieval, as summarized in
Table 10 from the analysis of the 78 articles. A primary concern is the ‘inconsistent document structure and formats’ [
42,
59,
60,
63,
71,
72,
85,
89,
93,
107], across patents. This inconsistency complicates retrieval due to the blend of structured and unstructured sections, diverse formats, and multiple languages. These variations stem from the nature of patent filings, which must accommodate a wide range of innovations and legal requirements across different jurisdictions. This leads to variations in how information is organized and presented, including differences in the use of structured (e.g., claims, abstracts) and unstructured (e.g., descriptions, drawings) sections, and the acceptance of documents in various formats and languages to suit international standards and applicant preferences. These variations pose challenges for retrieval systems in effectively interpreting the diverse content.
Another significant obstacle is the ‘sophisticated patent language and vocabulary’ [
44,
47,
69,
70,
87,
88,
96,
109], which arises from the need for precise descriptions and claims of innovation. The use of intricate technical, legal, and domain-specific terms aims for legal accuracy but introduces complexity in the retrieval process. Search systems must interpret and match the nuanced vocabulary used in patents, which is compounded by the vast array of patent documents across different jurisdictions with varying requirements and languages requiring translation. The challenge of accurately parsing and understanding this intricate language underscores the need for retrieval techniques sophisticated enough to effectively handle these linguistic complexities.
Several researchers have also reported ‘query reduction and query expansion’ to disambiguate the query as a significant challenge for patent retrieval [
39,
61,
62,
74,
78,
107]. Patent documents are inherently complex due to their lengthy nature and extensive use of technical and specialized terminology. Accurately capturing the query’s requirements and goals is crucial to retrieving the most relevant patent documents. Therefore, researchers have developed techniques to disambiguate the query, sometimes by reducing the query to remove irrelevant and noisy terms, and sometimes by expanding the query to add more relevant keywords using known external sources. Researchers must strike a balance between oversimplification and adding complexity to the query to enhance the patent retrieval results.
‘Term mismatch’ is another crucial challenge [
44,
46,
61,
73,
78,
94,
113], with researchers addressing the impact of semantic and conceptual variation as well as multilingual complexities on patent retrieval. Relevant documents might not be retrieved due to the use of synonyms and specialized terms in the query; therefore, researchers must develop techniques that bridge the gap between user queries and patent documents.
Some studies have reported the challenge of ‘information asymmetry and overload’ [
47,
63,
68,
75,
92,
93], which impacts the patent retrieval process due to imbalances in the availability of information among patent offices, applicants, and researchers, and the sheer volume of information related to patents. Applicants and researchers often lack access to cutting-edge tools and comprehensive databases, unlike patent offices, and this limited access can deter their ability to conduct exhaustive patent retrieval searches. This asymmetric access can impact the outcomes of patent searches and the overall patent application process.
The ‘requirement of a high recall’ has also been listed as one of the major challenges faced by the patent retrieval process, particularly for prior art searches [
39,
50,
66,
89,
107]. High recall is necessary in prior art searches to ensure thorough scrutiny of all potential references that may impact the novelty of a new patent or invention application. Without achieving high recall, patent retrieval systems risk overlooking significant relevant patent documents and potentially facing legal issues.
Several studies have reported challenges related to the ‘limitations of keyword-based searches’ [
66,
114], where the focus is on overcoming the intrinsic shortcomings of keyword searches, including their inability to recognize synonyms and capture semantic meanings. Moreover, formulating complex queries involving multiple concepts or special operators, such as Boolean and proximity operators, is challenging in keyword-based searches. This can affect the outcome of patent retrieval systems. Likewise, some studies have dedicated efforts to enhancing ‘retrieval results accuracy and efficiency’. Precision and recall are both crucial in patent retrieval systems since high precision ensures that the number of retrieved patents is relevant, while high recall ensures that all relevant patents are retrieved. Nevertheless, it is a challenge to strike a balance between the two. Researchers must develop techniques that consider not just user requirements but also the context of the required search. For example, references [
57,
84,
111] propose efficient techniques based on the Skip-gram and TF–IDF models.
Some challenges have been reported by just one study, such as ‘commercial patent analysis’ [
115], typically conducted to aid strategic decision-making; ‘finding the most relevant classification codes’ [
117], required as part of accurately categorizing the patent applications; ‘lack of data for training the BERT’ [
44], related to the insufficiency of diverse datasets for BERT training; ‘examiner citations recommendation problem’ [
81], referring to the challenges that patent examiners might face in conducting prior art searches where proper referencing is not performed on the existing patent literature; and ‘patent ranking’ [
75], related to the process of assigning ranking based on relevance.
4.5. Limitations and Future Directions
This systematic literature review (SLR) is subject to several limitations that should be considered when interpreting its findings. Firstly, the temporal scope of this review is confined to the past decade. While this allows for a focus on recent advancements and trends, it may omit relevant insights and foundational work published before this period, possibly skewing the understanding of the evolution and development of patent retrieval techniques. Secondly, the review is specifically concentrated on patent prior art retrieval. This focus provides depth in one area but limits the breadth of the investigation, potentially overlooking broader aspects of patent retrieval such as infringement checks or patent validity analysis, which could provide a more holistic view of the field. Thirdly, the limitation of sourcing only from journal articles and conference papers might miss significant insights found in other forms of literature, such as book chapters, industry reports, and technical reports. These sources often contain valuable practical applications and case studies that could offer a different perspective or complement the academic viewpoints presented.
Moreover, the review exclusively considers English-language articles, thereby restricting the diversity of viewpoints and potentially missing important contributions from non-English-speaking regions. This language barrier may introduce a bias towards English-speaking researchers’ perspectives and methodologies, possibly overlooking innovative approaches developed in other linguistic contexts. Additionally, the exclusion of non-text-based data might lead to overlooking complex innovations that are not easily described in text form. This exclusion could limit the comprehensiveness of the review, especially in fields where visual data are paramount. Lastly, the omission of documents for which full-text access is not available could result in overlooking critical studies. This reliance on accessible full-text documents might bias the review towards more readily available or popular sources, potentially missing out on pivotal but less accessible research.
Each of these limitations not only frames the current findings but also sets the stage for future research directions. Addressing these limitations in subsequent studies could expand the understanding of patent retrieval practices, offering a more comprehensive and inclusive perspective. Future research could aim to include a wider range of sources, extend the temporal coverage, incorporate multilingual studies, and consider non-textual data to enhance the richness and applicability of the findings.
The introduction of Generative AI in patent retrieval offers a promising avenue to address some of these limitations by enhancing search accuracy, automating patent classification, and providing advanced summarization and translation features. With its sophisticated ability to comprehend the nuances of search queries, Generative AI could revolutionize prior art searches, forecast innovation trends, and enable more dynamic and interactive querying processes. However, the integration of this technology must be approached with caution, carefully considering potential challenges such as bias, data privacy, and the costs associated with implementation.
Future research should also broaden the scope of sources used in patent retrieval. This includes incorporating multilingual and non-textual data to better capture the global and diverse range of innovations. Additionally, while non-patent literature may not traditionally fall within the strict bounds of patent retrieval, its integration is crucial for comprehensive prior art searches to establish novelty. Exploring non-patent literature alongside traditional patent databases enriches the context and depth of searches, potentially uncovering prior art that patent documents alone might miss. Furthermore, examining different sections of patent documents, such as claims and descriptions, can provide a richer and more comprehensive dataset for analysis. Expanding the variety of databases explored in research, from well-established patent databases to newer or less utilized sources, can also enhance the depth and breadth of retrieval results. By embracing a wider array of information sources, researchers can develop more robust systems for patent analysis that better reflect the complexities and nuances of innovation across different industries and regions.
Further, embracing interdisciplinary approaches that utilize different technological advances can significantly enrich the field of patent retrieval. Techniques from data analytics, information retrieval, and artificial intelligence could be synergistically combined to develop more robust systems for managing and analyzing patent information. These advancements aim to make patent retrieval more comprehensive, inclusive, and effective, enabling researchers to navigate the complexities of global innovations more efficiently.
This forward-looking approach is designed not only to address the identified limitations but also to harness cutting-edge technologies and methodologies to expand the boundaries of what is currently possible in patent retrieval research. By adapting to and integrating these innovations, the field can evolve to meet the challenges of an increasingly complex intellectual property landscape.