1. Introduction
Patent keyword analysis (PKA) is important to technology management because a patent contains extensive and detailed information about the developed technology. Using the PKA results, we can build research and development (R&D) plans and strategies for the target technology. In general, for PKA, we extract technology keywords from patent documents using text mining techniques [
1,
2]. Using the extracted keywords, we construct a patent–keyword matrix for PKA based on statistics and machine learning algorithms. The matrix consists of elements representing the frequency values of keywords that occur in patents [
1,
3,
4,
5]. In most cases, this matrix has a sparse data structure that suffers from the zero-inflated problem [
3,
4,
5]. This is because a keyword that is included in even just one patent document becomes one column in the matrix [
3,
4,
5]. The sparse zero-inflated problem reduces the performance of PKA models [
4,
6]. As such, we have to solve the zero-inflated problem for PKA. Many existing studies rely on statistical models such as the zero-inflated Poisson and negative binomial models to solve the problem [
5,
7,
8,
9,
10,
11]. Recently, studies based on machine learning methods such as generative models have been conducted to solve the zero-inflated problem [
4,
5]. However, existing models have limitations in that model performance deteriorates as the proportion of zeros included in the data increases [
3,
4,
5,
7,
8]. To solve this problem, we consider a regression model based on a quantile Cumulative Distribution Function (CDF) [
12,
13,
14]. We call this model CDF-based Quantile Regression Model (QRM) in this paper. To verify the performance of the CDF-based QRM, we perform experiments using patent documents related to blockchain technology.
The motivation for this research is to appropriately deal with the zero-inflated problem that occurs in patent keyword data analysis. In particular, we study a method to overcome the extreme zero-inflated problem, where the proportion of zeros in the given data exceeds half. Since the extreme zero-inflated problem is difficult to solve even with existing statistical zero-inflated models, we need to find new methods to solve it.
The remainder of this paper is organized as follows. We survey works related to our research such as regression and zero-inflated models in
Section 2. In
Section 3, we present the theoretical explanation of our proposed method and the analysis process step by step. In addition, we present the performance evaluation indexes of comparative models in this section. Next, we show the improved performance and validity of our proposed method from the experimental results using patent documents related to blockchain technology in
Section 4. In this section, we compare the model performance of the CDF-based QRM with traditional linear regression and statistical zero-inflated models. In the
Section 5, we illustrate how the proposed method can be applied to practical tasks in various domains. Lastly, we provide the conclusions and future works related to our research in
Section 6.
2. Related Works
Patent analysis has been performed in various technology domains such as photovoltaic, medicine, mountain logistics, climate change, artificial intelligence (AI), surgery, and energy [
15,
16,
17,
18,
19,
20,
21]. This is because when developers register a technology they have developed as a patent, they are guaranteed exclusive rights to use their technology for a certain period of time. Therefore, we analyze patents to understand these technologies. Also, we use the results of patent analysis for technology management such as Research and Development strategy development. PKA, which we propose in this paper, is also a field of patent analysis. PKA mainly extracts technology keywords from the abstracts and claims contained in collected patent documents and analyzes them. In this process, we use text mining and various data analysis methods based on statistics and machine learning.
The regression model is very popular in machine learning as well as in statistics [
12,
22,
23,
24]. This model consists of independent and dependent variables called X and Y, respectively [
22]. Regression analysis is statistical modeling that explores relationships between variables [
24]. We can predict Y for a given X using regression analysis [
19].
Figure 1 shows a process of regression modeling [
23].
We assume that the response variable Y adds the error
to the explanatory variables
Using
and
, we create the linear regression model (LRM) as follows.
In Equation (1),
is
, and we estimate the model parameters,
that minimize the error using the least squared loss function [
12,
24]. The error represents a random noise included in observed data and follows a normal distribution with a mean (
) = 0 and variance =
. This model has provided good performance in exploring the relationship between
and
in most data, including errors with a mean of 0 and equal variance [
24]. However, we have difficulty in using the LRM when the given data does not satisfy the model assumptions [
9,
10,
11]. In particular, if the given data has many extreme values, we cannot use the LRM [
4,
5,
6]. To solve the problems of LRM, we can consider the QRM [
13,
25]. Quantile regression aims to model the impact of explanatory variables on the quantile of the response variable. The QRM finds the conditional quantile of
just as the regression based on the least square method estimates the conditional mean of
[
9]. We can apply both continuous and count data to QRM. QRM is a model that can be used when the given data do not satisfy the normality assumption and are asymmetric or contain many outliers. In the PKA, we found that the patent–keyword matrix contains zero-inflated data that is sparse and asymmetric. As such, we propose a method to analyze the patent keywords using QRM. In addition, we consider the CDF for our PKA model based on QRM because we aim to predict the specific quantile of each patent keyword.
In statistics, the zero-inflated model is typically used to analyze data that contain a lot of zeros [
8,
10,
11]. This model has been used to solve the zero-inflated problem that occurs in various domains [
26,
27,
28,
29]. The zero-inflated model based on statistics is defined as follows [
9]:
In Equation (2),
f(
x) is a density of random variable
X =
x. In the statistical zero-inflated model, the probability model of
X is separated into two parts of zero and non-zero [
9,
10,
11]. The
represents the probability of zero occurrence. Although the statistical zero-inflated models have been used to overcome the problem that arises in various data analysis processes, they have shown a problem in that model performance deteriorates as the proportion of zeros in the data increases [
3,
4,
5]. Therefore, in this paper, we propose a PKA method using QRM for analyzing patent keyword data with a high zero ratio.
4. Experiments and Results
The experiments were carried out using practical patent documents to illustrate how the proposed method can be applied to real fields. We collected patent documents related to blockchain technology from world patent databases [
30,
31]. Blockchain technology has been developed by relying on the blockchain-related technologies such as bitcoin and cryptocurrency. So, in this experiment, we provide the technological relations between blockchain technology and other related technologies based on the keywords of blockchain, access, authentication, bitcoin, cryptocurrency, databank, distributor, encash, ledger, network, and secretkey. In this paper, we determined blockchain technology as our target domain. Blockchain is defined as a technology for securely managing data across distributed systems [
6]. We select the keyword blockchain as the response variable and use the remaining ten extracted keywords (access, authentication, bitcoin, cryptocurrency, databank, distributor, encash, ledger, network, and secretkey) for explanatory variables (
).
Figure 2 shows the process of our proposed modeling of PKA.
First, we collected the patent documents related to blockchain technology using keyword search expression from patent databases across the world [
30,
31]. Next, we chose the valid patents representing blockchain technology and preprocessed the valid patent documents. In our experiments, we used the R project as a tool for statistical analysis. R is a free, open-source piece of software that supports statistical analysis and visualization [
32]. The current version of R has been upgraded to 4.4.2. Up until now, R has been widely used for statistical analysis of data generated in various fields [
33]. We also used the tm package of R for text mining [
1]. This package provides many functions for preprocessing of text data using natural language processing [
1,
2]. Lastly, we used the cdfquantreg package of R for QRM [
14]. In addition, using the functions provided in the R base module and the pscl package, we carried out performance evaluation between the proposed model and the comparative models [
32,
34]. The elements of this matrix are the frequency values of the keywords occurring in the patent documents. This is structured data that can be used in CDF-based QRM. Also, we determined the keyword of blockchain for the dependent variable and used the other keywords for independent variables in this experiment. To select the patent keywords for blockchain technology, we considered the results of keyword extraction from previous research related to blockchain technology analysis [
32]. Therefore, we determined one response variable (blockchain) and ten explanatory variables (access, authentication, bitcoin, cryptocurrency, databank, distributor, encash, ledger, network, and secretkey). We used the R project and package for our experiment [
1,
2,
14,
32,
33,
34].
Table 1 shows the summary statistics of the patent keywords.
In the results in
Table 1, we found that the patent–keyword matrix data is very sparse and zero-inflated because most elements of the matrix are zeros. The median values of most keywords were also zero. Therefore, we have difficulty analyzing the patent keyword data using traditional data analysis methods. To overcome the problem, we proposed patent keyword analysis using CDF-based QRM in this paper. In the CDF-based QRM, the response variable must have real numbers between 0 and 1. So, we changed the values of blockchain keyword by the following normalization.
Using the Equation (14), the values of the response variable are changed to numerical values in the interval (0,1). The model of patent keyword analysis consists of one response variable of the keyword blockchain and ten explanatory variables of all keywords except blockchain as follows.
: blockchain
: access, authentication, bitcoin, cryptocurrency, databank, distributor, encash, ledger, network, secretkey
Using the indexes of (11), (12), and (13), we compared the performance between CDF-based QRM and LRM.
Table 2 shows the results of model performance between the compared models according to loglikelihood, AIC, and BIC. In this paper, we compared our proposed QRM with LRM and the zero-inflated model in terms of explanatory and predictive power. Loglikelihood is an index that measures the explanatory power of the model, and AIC and BIC are indexes that compare the predictive power between models.
In
Table 2, to compare the performance of CDF-based QRM and LRM, we built simple models consisting of one keyword each and a full model using all keywords. First, the loglikelihood result shows that the loglikelihoods of CDF-based QRM for all keywords are larger than those of LRM. This shows that the results of patent keyword analysis using CDF-based QRM are better than those of the LRM. Next, in the comparison results based on AIC, the AIC values of CDF-based QRM are smaller than those of LRM for both the model using all keywords as well as the model using each keyword. We illustrate that CDF-based QRM is superior to LRM from the AIC perspective. Lastly, we compared the BIC values between CDF-based QRM and LRM. In
Table 2, we can see that the BIC values of CDF-based QRM are larger than those of LRM. This means that the model performance of CDF-based QRM is better than LRM. Therefore, we show the validity of our proposed approach to patent keyword analysis from the comparison results by loglikelihood, AIC, and BIC.
The last column of each index that evaluates the performance of the model presents the results of the analysis using the statistical zero-inflated model. In this paper, we used the zero-inflated Poisson (ZIP) model as a statistical zero-inflated model [
10,
11]. This model uses the Poisson distribution as the probability function of the statistical zero-inflated model. The following shows the ZIP model [
10,
11].
Equation (15) uses the probability function of the Poisson distribution as
f(
x) in Equation (3). In Equation (15), the
is the parameter of Poisson distribution. In all indexes of loglikelihood, AIC, and BIC, we confirmed that model performance of ZIP is inferior to that of QRM or LRM. This is because the proportion of zeros included in the patent–keyword matrix data exceeds half, as we confirmed in
Table 1. Therefore, we could confirm that our QRM is superior to the LRM or ZIP models. Finally, we represent the estimated parameter and
p-value of each keyword in
Table 3.
Depending on the keyword, we found that some keywords have a positive impact on blockchain while others have a negative impact. Additionally, through the result of
p-value, we confirmed that the keywords bitcoin, cryptocurrency, databank, ledger, network, and secretkey have a statistically significant impact on blockchain technology because the
p-values of these keywords are less than 0.05 at the 95% confidence level. We can apply the results in
Table 3 to various technology management areas such as R&D planning. From the result of
Table 3, we constructed a technology diagram of blockchain in
Figure 3.
Among the 10 keywords related to blockchain, we can see that the keywords of bitcoin, cryptocurrency, databank, ledger, network, and secretkey have a statistically significant effect on blockchain. Therefore, we can see that technologies based on these keywords are primarily necessary for the development of blockchain technology. We expect that these results will contribute to R&D planning for blockchain technology development in countries and companies.
5. Discussion
From the result of
Table 1, we found that the patent keyword data related to blockchain technology exhibit the zero-inflated problem. So, we used the proposed method to solve the problem. From the results in
Table 2, we confirmed that the performance of QRM is better than those of the LRM and ZIP. Therefore, we estimated the model parameters and their p-values using the QRM in
Table 3. Lastly, using the results in
Table 3, we constructed a technology diagram of blockchain in
Figure 3. From the results in
Figure 3, we found that the sub-technologies based on the keywords of bitcoin, cryptocurrency, databank, ledger, network, and secretkey have a significant influence on the development of blockchain technology. In this paper, we applied the proposed method to analyze patent keyword data related to blockchain technology. From our experimental results, we showed how our method could be applied to real technology domains. Although the practical technology domain we used is blockchain, we believe that our proposed method can be extended to other technology fields. Once the target technology is determined, patent keyword analysis can be performed according to each step of the method proposed in this paper. Through this, we can conduct patent analysis necessary for R&D planning, new product development, technology forecasting, and technology innovation required in technology management.
In addition, we derived the QRM based on CDF to analyze the patent keywords. The patent–keyword matrix, which is usually used for patent keyword analysis, contains a large number of zeros, making it difficult for us to use existing linear models. If there are too many zero values, the zero values dominate the model building, which reduces the model performance. To solve this problem of zero inflation, we studied and proposed the CDF-based QRM in this paper.
In our study, we tried to identify the relationship structure between technologies through the PKA.
Figure 3 was the final result obtained from our study. The technology diagram of
Figure 3 provides a list of keywords that are statistically significant for blockchain technology. Therefore, in order to effectively develop blockchain technology, we must pay attention to detailed technologies based on these keywords. However, the results in
Figure 3 do not provide any predictive information about the future technology of blockchain. In order to continuously develop blockchain technology, we need to predict the technology of blockchain. In addition, predicting the next technology related to the target technology will also be very meaningful in understanding the technology. To this end, it would also be meaningful to study how to use machine learning methods to predict the next behavior of animals [
35]. Just as past patterns of animal behavior can be analyzed to predict future behavior, past patterns of technological development can be modeled to predict future technologies.
6. Conclusions
This paper presents a statistical model in order to solve the zero-inflated problem. We collected patent documents related to blockchain technology and analyzed them using a statistical data analysis method. Blockchain technology is a data management technology with distributed secure applications in various domains such as the financial field of Bitcoin. This technology is based on decentralization, immutability, transparency, and security. In this process, we constructed a patent–keyword matrix using preprocessed data for statistical analysis. Each element of this matrix is a frequency value of a keyword’s occurrence in a patent document. Because most of the elements in this matrix are zero, we had difficulty analyzing this matrix using statistical analysis methods including the zero-inflated model. Therefore, we proposed a method of PKA to overcome the zero-inflated problem in the preprocessed patent data. Compared to existing single models such as LRM, we considered an analysis model consisting of two sub models representing location and dispersion. In addition, we changed the value of the response variable to a (0,1) interval. This is the concept of the CDF-based QRM.
In our experiment, we compared the model performance of the CDF-based QRM with LRM and ZIP to show the improved performance of our model. We searched the patents related to blockchain technology. The analytical results provided by the CDF-based QRM, LRM, and ZIP were evaluated using loglikelihood, AIC, and BIC. We found that all experimental results of the CDF-based QRM were better than those of the LRM and ZIP. Therefore, we showed the validity of the CDF-based QRM for our PKA. In the CDF-based QRM, we normalized the scale of the response variable to solve the zero-inflated problem and confirmed the improved performance of the proposed method.
In this paper, the proposed model was used to finally select technology keywords that have a statistically significant impact on blockchain technology. We had difficulty identifying technological relationships between patent keywords using our proposed model. However, understanding the interrelationship structure between the sub-technologies required for blockchain technology development is an important task in understanding this technology. This part represents the limitations of our study. To overcome the limitation of our proposed model, we considered social network analysis (SNA) and Bayesian learning. In our future works, we apply SNA to our CDF-based QRM to make a technology diagram representing the technological relations between the patent technology keywords. In addition, we will apply Bayesian learning to the CDF-based QRM. We call this Bayesian learning for QRM. In this model, we assume the prior distributions for the parameters of the QRM model. This learning model updates the model parameters using the given data. That is, we will be able to improve the QRM performance of explanatory and predictive power using the Bayesian learning process whenever new data are added. The prior distribution of the parameters is updated by combining it with the likelihood function of newly observed data to form the posterior distribution of the parameters. That is, the parameters become random variables with probability distribution functions rather than fixed values and can be effectively used in the analysis of a zero-inflated patent–keyword matrix. Our research is expected to contribute to various fields by improving understanding of technology and finding relationships between detailed technologies through our PKA.