1. Introduction
Utilizing Application Programming Interfaces (APIs) is common in software development. Studies indicate that approximately half of the method calls in Java projects are API-related [
1]. Harnessing the power of APIs for programming is of great value. Nevertheless, it is difficult for developers to master the knowledge of all APIs in a library. Therefore, to find proper APIs to use, engineers must study API documentation and other relevant documents and even search for online help. The study in [
2] shows that effective understanding and use of APIs depend on domain concepts, API usage patterns, and API execution facts such as inputs, outputs, and side-effects [
2]. Most existing API recommendation methods utilize API usage patterns [
3,
4,
5] and API execution facts [
6] that are collected from historical data sets. Some API recommendation studies are called with knowledge graphs [
7,
8] but only utilize Mashup-API co-invocation patterns and service category attributes [
7] or properties of the third-party libraries (e.g., version, groupID and language) [
8]. These studies do not use domain knowledge, such as domain concepts and the semantic relations among them, while these are the key concerns in knowledge-graph-based tasks [
9,
10].
Like API recommendations, test function recommendations focus on recommending test function(s) to realize a given test step as a query. However, there exist fundamental differences between recommending test functions and APIs. First, test functions are used in a confined domain, i.e., for testing a specific System Under Test (SUT), whereas APIs (e.g., standard and third-party libraries) are used in a more open application context. Second, test functions embody domain knowledge in more implicit ways; as discussed in our previous study [
11], test functions from industrial datasets have significantly fewer words to describe their functionalities than methods from standard API libraries, and a test function library is not intended to be used externally for testing other systems.
Since test functions are used in more confined domains and are usually developed to test a specific software product, as opposed to using APIs from libraries for programming, test engineers often search for test functions by reading through existing test scripts or obtaining helps from senior peers. Hence, test function recommendations usually face the challenge called “cold start”, a phenomenon where effective inferences are hindered by insufficient historical data. Numerous API recommendation methods depend on historical data, e.g., feature request history [
3], Stack Overflow posts [
4], and clickthrough data from search engines [
12]. These methods work well when historical data are abundant; otherwise, their performances significantly degrade. According to the study reported in [
4], utilizing only API documentation as a data source is 16.9% as effective as compared to utilizing both Stack Overflow posts and API documentation. Since domain knowledge can compensate for the insufficiency of historical data effectively [
13,
14], it is a valuable solution to overcome the “cold start” challenge.
The second challenge is called “semantic gap”, which denotes that gaps exist between the description of a test step and the description(s) of test function(s) implementing the test step [
4,
11]. One of the main causes of the semantic gap is that different engineers express concepts or test behaviors in different words, though they are equivalent in semantics. For example, “Synchronous Digital Hierarchy” and its abbreviation “SDH” are both used in the test step and test function description. Recommending testing functions becomes difficult without such domain knowledge that “SDH” and “Synchronous Digital Hierarchy” are synonymous. The other reason is that the words or phrases used in test steps and test function descriptions are different but semantically related. For example, as we observed in the Huawei data set, the test step “create an EPL service” is implemented by test function “create_eth” whose description is “create an Ethernet service”. Misunderstanding might happen when a test engineer does not know that the “EPL service” is a type of “Ethernet service”. In the past, efforts have been made to overcome the semantic gap using latent semantic analysis but still failed to recommend test functions accurately due to the lack of historical data [
11].
To address the aforementioned two challenges, we propose an approach named TOKTER for a test function recommendation based on a test-oriented knowledge graph capturing domain knowledge about the system under test (SUT) and test harness. The domain knowledge about SUT comprises domain concepts that specify its functionalities, business flows, data definitions, etc., and their relations (e.g., synonymy, hyponymy, and association). The domain knowledge about the test harness refers to the structure in which a specific testing project is developed and managed, e.g., test cases, test steps, test functions and their parameters, and the relations among these artifacts. TOKTER first constructs a test-oriented knowledge graph using the corpus data of a given test project, then recommends test functions based on the constructed knowledge graph by using the semantic relations between test steps (i.e., query) and test functions according to the literal descriptions, test function parameters, and historical occurrences.
To evaluate TOKTER, we choose BIKER [
4], CLEAR [
5], and SRTEF [
11] as the comparison baselines and employ an industrial dataset from our industrial partner Huawei. We use the metrics of Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Mean Recall (MR) to evaluate their performance. Results show that TOKTER achieved 0.743 in MAP, 0.824 in MRR, and 0.804 in MR for the top-10 recommendation results and significantly outperformed the baselines by more than 36.6% in MAP, 19.6% in MRR, and 1.9% in MR.
Contributions. (1) We propose a knowledge-graph-based test function recommendation approach, which utilizes both domain knowledge about the SUT and test harness; (2) we propose three types of meta-paths explicitly designed for test-oriented knowledge graphs, description-based, parameter-based, and history-based meta-paths, to discover candidate test functions from different perspectives; (3) we implement TOKTER as an open-source tool (
https://anonymous.4open.science/r/TOKTER-78DE (accessed on 7 March 2024)); and (4) we evaluate TOKTER with an industrial dataset and demonstrate that it is practically useful and advances all the baselines.
Structure. Section 2 describes the background. TOKTER is presented in
Section 3. We present the experimental design in
Section 4, followed by experiment results and analyses in
Section 5. The related work is described in
Section 6, and the paper is concluded in
Section 7.
2. Background
Knowledge graph offers a practical and effective way for capturing and representing concepts and relations among concepts, which is often specified with the Resource Description Framework (RDF) standard, whereby a node corresponds to an entity or concept, and an edge between two nodes denotes a relation between the corresponding entities or concepts. Its formal definition is given as follows.
Definition 1 (Knowledge Graph). A knowledge graph is a directed graph where is the set of entities and is the set of edges among entities. Each edge has the form of (head entity, relation, tail entity) (denoted as ), indicating a relation of r from entity to entity .
For example,
(cat, kind of, mammal) declares that a cat is a kind of mammal. As a heterogeneous network, a knowledge graph can contain multiple types of entities and relations. Knowledge graphs have been widely applied, for instance, in search engines [
10,
15], recommending systems [
16,
17], and question answering systems [
9,
18], etc.
There are two kinds of knowledge graphs: cross-domain (or generic) ones, e.g., Freebase [
19], DBpedia [
20], and YAGO [
21], and domain-specific ones, e.g., MovieLens [
22]. The test function recommendation needs a domain-specific knowledge graph to capture the links between domain knowledge and a confined specific test project instead of these public cross-domain knowledge graphs. In current studies of constructing domain-specific knowledge graphs, despite the adoption of complex, trained linguistic analyzers (e.g., dependency parser), manual work by domain experts is still indispensable to achieve good results [
23,
24,
25].
To facilitate the construction of a knowledge graph, a network schema has been introduced to specify the structure of a knowledge graph at a meta-level (schema-level) [
26] and to define types of entities and relations. Its formal definition is provided as follows.
Definition 2 (Network Schema). A network schema, denoted as , is a meta template for a knowledge graph with the entity-type mapping and the relation-type mapping , which is a directed graph defined over entity types , with edges as relations from .
For example, the bibliographic information network is a typical knowledge graph, also known as a heterogeneous information network [
26]. It contains four types of entities: papers (
P), venues (i.e., conferences/journals) (
C), authors (
A), and terms (
T). Each paper
has links to a set of authors, a venue, a set of words as terms in the title, a set of citing papers, and a set of cited papers. The network schema for this bibliographic information network is shown in
Figure 1a.
In a knowledge graph, a meta-path is a sequence of relations connecting more than two entities. As shown in
Figure 1b, the
author and
paper can be connected via
path. For a given author, this meta-path identifies papers published in the same venue as the author’s publications. In this meta-path,
represents the category of entities rather than individual authors; then, such a path is called a meta-path. Its formal definition is given below.
Definition 3 (
Meta-path)
. A meta-path is a path defined on the graph of network schema . A meta-path defines a new composite relation between type and , where and for . Informative meta-paths in a knowledge graph are often engineered manually based on domain knowledge or expertise [27]. 4. Experiment Design
Baseline. To evaluate TOKTER, we select three baselines. BIKER [
4] is a classic approach in API recommendation and is often used as a baseline in the literature [
5,
11,
43]. BIKER exploits Stack Overflow posts and API documentation and tackles the task-API knowledge gap. BIKER uses a word embedding model to represent question (query) text and API description text. CLEAR [
5] embeds the whole sentence of queries and Stack Overflow posts with a Bidirectional Encoder Representation from Transformers (BERT) model that can preserve the semantic-related sequential information. CLEAR is a state-of-the-art approach in API recommendation. We select CLEAR because it uses BERT and, hence, ensembles the performance of Large Language Models (LLM) on test function recommendations. We implemented and trained CLEAR according to the model settings described in [
5]. SRTEF [
11] exclusively recommends test functions based on weighted description similarity and scenario similarity. The former employs the Deep Structured Semantic Model (DSSM) to assess the relatedness between test steps and test functions based on their literal descriptions. The scenario similarity is calculated by considering both test scenarios and test function usage scenarios. We selected SRTEF as one of the baselines because it is the closest to TOKTER as both focus on test function recommendation.
Dataset. The dataset we use is from our industrial partner Huawei. The collaboration context is about testing a Network Cloud Engine-Transport (Huawei, Shenzhen, China) product, one of the components enabling the transformation of a transport network towards modern software-defined networking. The dataset consists of historical test cases, test function documentation, and SUT documentation. There are 6514 historical test cases implemented in Ruby, and each test case has 8.7 test steps on average. Since some test steps are included in more than one test case, there are, in total, 9924 unique test steps. Each test step contains one or several sentences written in Chinese embedded with some English terms and is implemented by 2.9 test functions on average.
Table 5 shows the number of test functions used to realize a test step. According to the statistics, we randomly select 500 test steps with the corresponding test functions proportionally as testing data to evaluate TOKTER, including 180 test steps implemented by one test function, 102 test steps implemented by two test functions, 70 test steps implemented by three test functions, 53 test steps implemented by four test functions, and 95 test steps implemented by more than five test functions. The test set is exactly the same as in our previous study SRTEF [
11].
The test function documentation contains 967 test functions. Each test function has a corresponding description in Chinese to explain its functionality and usage. There are a total of five documents related to the SUT. These documents encompass interface descriptions and functionality usages. We consider them as unstructured texts, which contain, in total, 109,978 sentences comprising 728,065 words.
Metrics. To compare TOKTER with the baseline approaches in terms of their effectiveness, we employ three metrics, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Mean Recall (MR), which are the classical metrics for information retrieval [
44] and have been frequently used in software engineering studies [
3,
4,
45,
46,
47,
48,
49]. MAP can be calculated with the following formula:
where
represents the set of test functions that actually implement the test step (corresponding to query
q), and
is the size of the test function set.
represents the position or order at which the
ith test function of
appears in the recommendation list. For MAP,
represents all queries, and
is the number of those queries. For each query
q in
, the average precision
is calculated using Equation (
11), and Equation (
12) computes the mean of the
of all queries.
MRR can be calculated with the following formula:
where
is the ranking of the first test function in
for query
q.
MR can be calculated with the following formula:
where
is the number of test functions for query
q, and
is the number of test functions that actually implement
q.
Since in recommendation tasks, top-matched results are of great interest in practice, we evaluate TOKTER and the baselines with their top-10 recommendation results, which is commonly applied in API recommendation studies [
3,
45].
Statistical Test. We use the non-parametric Wilcoxon signed-rank test [
50], which has been widely used in API recommendation studies [
4,
5] to test whether there are statistically significant differences in the effectiveness and time cost of different approaches or strategies, with a significance level of 0.05. We also use the Spearman correlation analysis to assess the correlation between recommendation time and the number of domain concept entities connected to the query.
Tool Implementation. The tool contains two functionalities, test-oriented knowledge graph construction and test function recommendation, both implemented in Python. For the construction of the test-oriented knowledge graph, TOKTER utilizes the NLP tool HanLP [
41] for segmentation and dependency parsing, along with the NetworkX tool [
51] for representing and storing knowledge graphs. For the test function recommendation, TOKTER implements the algorithm that searches for path instances matching the meta-path on the constructed knowledge graph in Python. It then ranks and recommends all test functions based on the scores of all path instances. We implement TOKTER as an open-source tool.
Experiment Execution. For training the CLEAR baseline, we utilized cloud computing resources (running about 20 h on NVIDIA V100 GPU (Nvidia Corporation, Santa Clara, CA, USA) and Intel Golden 6240 CPU (Intel Corporation, Santa Clara, CA, USA) with CentOS 7.6 system). Due to the substantial scale of the BERT model used in CLEAR, fine-tuning the pre-trained BERT model on the dataset is time-consuming. The process of recommending test functions using the four methods was conducted on a PC equipped with AMD Ryzen 7 1700X CPU (AMD, Santa Clara, CA, USA), running Windows 10 (64-bit).
Research Questions (RQs). RQ1: How effective is TOKTER compared with the three baseline approaches? These baselines use different models (i.e., word embedding, BERT, and DSSM) to measure the semantic similarity between two sentences (e.g., queries, test steps) and improve the recommendation effectiveness with historical data. TOKTER relies on a test-oriented knowledge graph and uses meta-paths to recommend test functions. With RQ1, we aim to assess TOKTER’s effectiveness in recommending test functions, which justifies the need to employ a knowledge graph and the meta-paths.
RQ2: How is the time performance of TOKTER compared with the three baselines? To test the efficiency of TOKTER, we compared its time performance with the three baseline approaches. Furthermore, we discussed the factors that influence recommendation time.
RQ3: How does the size of the knowledge graph affect the effectiveness and time cost of TOKTER? Recall that TOKTER relies on domain experts to select relevant candidates of domain concept entities automatically identified by TOKTER and identify those missed by TOKTER. Consequently, the size of the test-oriented knowledge graph, to a certain extent, relates to the manual effort required from domain experts. Hence, we propose this RQ to study how the size of the knowledge graph affects the effectiveness of TOKTER.
RQ4: How effective are different meta-paths at contributing to TOKTER? Recall that TOKTER uses different types of meta-paths: description-based, parameter-based, and history-based meta-paths. With this RQ, we aim to investigate how using different meta-paths can influence the test function recommendation effectiveness and time cost of TOKTER.
5. Results and Analyses
5.1. Test-Oriented Knowledge Graph Generated with TOKTER
Descriptive statistics of the generated test-oriented knowledge graph are summarized in
Table 6. From the table, we can see that, after extracting entities and relations from the test function library, we obtained, in total, 967 test functions with 3038 function parameters. From the test case library (historical test cases), TOKTER extracted 9924 unique test steps, 500 of which were used as testing data. Hence, the remaining 9424 test steps were incorporated as test step entities in the test-oriented knowledge graph.
TOKTER initially acquired 8931 candidates of domain concept entities (
Section 3.1.3). After a review process by two domain experts, 1536 suitable cases were selected, and 142 were added. In fact, the ranking metric (
defined in Formula (
5)) has been very helpful for filtering domain concept entities. Domain experts selected 56 domain concepts from the top 100 candidates and 474 domain concepts from the top 1000 candidates. The average
for the selected 1536 domain concepts is 1469.27, while the average
for all candidates is 658.559. Eventually, we obtained 1678 (1536 + 142) domain concept entities. The review process took two days, which is acceptable to our industrial partner. We also want to point out that this review process is a one-time effort. When the knowledge graph needs to be updated, it is just about updating the domain concepts in the knowledge graph (mostly adding new domain concepts), followed by the automatic extraction of relations with TOKTER.
After entity recognition, TOKTER extracts relations (
Section 3.1.4), which resulted in 203
synonymy, 290
hyponymy, 6750
association, 57,462
related_to_dc, 26,238
implementation, and 3038
containment relations. Note that domain experts do not need to check these relations manually.
5.2. RQ1: Evaluating TOKTER’s Effectiveness
To answer this RQ, we evaluated TOKTER with all meta-paths applied by comparing it with the three baselines in terms of their performance on the same industrial dataset (
Section 4). Results are presented in
Table 7.
From
Table 7, we observe that TOKTER performed the best, followed by SRTEF and CLEAR with comparable performance, and BIKER, which performed the worst. This is because BIKER’s word embedding model is weaker in capturing semantics than the DSSM model employed by SRTEF and the BERT model used by CLEAR. We conducted the Wilcoxon signed-rank test to compare the performances of TOKTER and the baselines with the 500 queries (we used the SPSS tool to perform these tests, and the results were also put into the open source repository). Results show that TOKTER achieves significantly better performance than all baselines (
p < 0.05). The standard deviations of the average precision (AP) metric for the BIKER, SRTEF, CLEAR, and TOKTER methods on 500 test queries are 0.121, 0.122, 0.132, and 0.134, respectively, which are quite small. TOKTER performed the best (0.743 in MAP, 0.824 in MRR, and 0.804 in MR) because it explores semantic relations between queries and test functions by utilizing domain knowledge capturing relations among domain concepts from the SUT and test steps, test functions, and their parameters of a test project. Moreover, TOKTER searches for semantically matched test functions comprehensively from descriptions, function parameters, and historical data via the meta-paths.
5.3. RQ2: Evaluating TOKTER’s Time Performance
To answer this RQ, we collected time spent by TOKTER and the three baselines on preparation and recommendation. Results are presented in
Table 8.
During the setup phase, the primary task for the three baselines is training their respective models. In the case of TOKTER, the time cost for setup is mainly about caching all paths in the test-oriented knowledge graph to accelerate the recommendation. As shown in
Table 8, TOKTER took a shorter setup time than CLEAR and SRTEF. CLEAR’s training time is the longest due to the significantly large scale of the BERT model (with 134 million parameters) used in CLEAR.
Regarding the recommendation time, from the table, we can see that TOKTER, SRTEF, and CLEAR took less than one second per query, while BIKER took 2.4 s per query. The main reason is that BIKER transforms each word in a text description into a word vector using its word embedding model, which requires a looped comparison for each word in the sentence. SRTEF and CLEAR, instead, represent queries, test steps, and test functions as vectors, and comparing vectors is computationally efficient. TOKTER searches for relevant test functions through the test-oriented knowledge graph, which requires a bit longer time than SRTEF and CLEAR. However, 0.6 s per query is acceptable in industrial practice.
The distribution of recommendation time for each query is shown in
Figure 7. We can observe from the figure that TOKTER is relatively consistent, with recommendation time clustering around 0.4–0.8 s. Only 10 out of 500 cases have 1.2–2.0 s spent. We notice that there are 53 cases for which TOKTER performed very efficiently (within the range of 0.2–0.4 s). After carefully checking these cases, we noticed that there are relatively few connections between queries and domain concept entities in the test-oriented knowledge graph.
Furthermore, we conducted a Spearman’s Rank Correlation analysis between recommendation time and the number of nodes linked to Q, i.e., , in the knowledge graph, which are all domain concept entities, ranging from 1 to 17. The correlation coefficient is 0.891 (p-value < 0.05), indicating a significant and very strong positive correlation. It is also easy to understand that the more entities connected to a query, the more paths need to be searched, resulting in a longer recommendation time.
5.4. RQ3: Impact of the Knowledge Graph Size
To answer this RQ, we evaluate the effectiveness and time cost of TOKTER when being configured with test-oriented knowledge graphs at different scales. As shown in
Table 6, among the four types of entities in the knowledge graph, the proportion of domain concept entities is roughly 11%, which is the smallest. Once the test function library and historical test cases are determined, the quantities of test functions, parameters, and test steps remain constant across different knowledge graphs. Therefore, knowledge graph construction’s quality and time cost depend solely on the domain concept entities. Thus, variations in the size of the knowledge graph are reflected in the number of included domain concept entities.
In real projects, it is often seen that some domain concepts are more frequently used than others. In the domain of the SUT of this work, the minimum frequency of domain concepts is 9; the maximum is 48,547; the average is 530.01; and the median is 102. The frequency distribution of domain concepts is shown in
Figure 8. More frequently used domain concepts are typically given priority in the construction of knowledge graphs. Therefore, we sort all valid domain concepts in descending order of the frequencies of their occurrences in the corpus. After that, we sequentially selected the top 10%, top 20%, …, and 100% of the domain concepts to form 10 test-oriented knowledge graphs.
Table 9 summarizes the effectiveness and time cost of TOKTER with test-oriented knowledge graphs of various sizes. Results show that with 10% of domain concepts, TOKTER only achieved ≈0.11 MAP, MRR, and MR. As the size of the graph increases, TOKTER’s effectiveness rapidly improves, implying that the completeness of the domain concepts in the knowledge graph has a great impact on TOKTER’s recommendation. Regarding the time cost (i.e., recommendation time), along with the increase in the number of domain concept entities in the knowledge graph, TOKTER’s recommendation time gets longer. This is because queries can be connected to more domain concept entities, leading to a growth in the number of path instances from the queries to test functions based on meta-paths, which need to be calculated.
From
Figure 9, we can easily notice that the size of the knowledge graph increases to ≈60%, TOKTER’s effectiveness and time cost gains stabilize. We further conducted the Wilcoxon signed-rank test to compare the difference in the effectiveness and time cost of TOKTER when being combined with the knowledge graphs of two adjacent sizes (e.g., 0.2 and 0.3 of Top% domain concepts). Results show that once the size of the knowledge graph increases to ≈60%, the change in recommendation effectiveness is no longer significant. However, the recommendation time increases significantly with the increase in the knowledge graph’s scale. This message is important as it implies that there is a diminishing marginal benefit of continuously enriching the domain concepts in the test-oriented knowledge graph. Specifically, with ≈30% of the domain concepts, TOKTER can already achieve 83.3% of the optimal results in MAP, MRR, and MR. This observation is close to the Pareto principle, also known as the 80–20 rule, specifying that 80% of consequences come from 20% of the causes. This observation is useful as it helps determine the sufficiency level of the domain concepts while constructing the test-oriented knowledge graph, especially considering that manual effort from domain experts is required to verify domain concepts automatically extracted by TOKTER.
5.5. RQ4: Meta-Paths’ Contributions to TOKTER’s Performance
To answer this RQ, we set up TOKTER with seven various meta-path strategies, which are all possible combinations of one, two, or all three meta-path types, as listed in
Table 10.
Table 10 shows that using any type of meta-paths alone is less effective than using all. In comparing the use of a single type of meta-paths, TOKTER with the parameter-based meta-path alone achieved the worst recommendation result, and TOKTER with only history-based meta-path achieved the best. We further conducted the Wilcoxon signed-rank test on data collected from the 500 queries. Results show that using all meta-paths is significantly better than all other strategies; the history-based strategy contributes statistically significantly more than the parameter-based and description-based strategies regarding all effectiveness metrics; and introducing the description-based strategy leads to a significant improvement in TOKTER’s effectiveness. Moreover, introducing the parameter-based strategy does not lead to a significant improvement because function parameters complement function descriptions, and the parameter-based meta-paths are expected to be combined with description-based ones to create positive effects on TOKTER’s performance. Regarding the time cost, TOKTER with the history-based strategy used significantly more time than those without, as expected.
5.6. Threats to Validity
Internal validity. TOKTER’s test-oriented knowledge graph construction heavily relies on NLP techniques, such as segmentation and dependency parsing. These tools, including HanLP we used, do not achieve 100% accuracy. However, TOKTER still remains highly effective because NLP techniques are relatively mature, and their inaccuracies are generally minimal. Outputs that are not precise still fall within the acceptable range in practice. Moreover, TOKTER is robust, as evidenced by the results for RQ3 (
Section 5.4); when slightly reducing the size of the knowledge graph, there is minimal effect on the effectiveness of the recommendations.
External validity. In our experiment, we only employed one dataset of one domain, i.e., Network Cloud Engine-Transport. However, the methodology of TOKTER itself is general; it can be applied for other domains as long as three types of data are available: test function documentation, SUT documentation, and historical test cases. Though TOKTER only supports processing texts in Chinese, as the entity recognition and rule-based relation extraction in the knowledge graph construction process are language-independent, and the metrics (e.g., mutual information and branch entropy) and rules involved are language-agnostic, TOKTER can easily support other natural languages, as long as proper segmentation tools and dependency parsers are available.
Construct validity. This refers to the suitability of our evaluation metrics. We use MAP, MRR, and MR for comparing TOKTER and the baselines. These metrics are classical evaluation metrics for information retrieval [
44] and are widely used for software engineering research [
3,
4,
46,
47,
48,
49].
6. Related Work
In the literature, SRTEF [
11] is presented for recommending test functions based on the weighted description and scenario similarity. However, due to the absence of domain knowledge utilization, SRTEF fails to address the challenges of “cold start” and “semantic gap” (see
Section 1), especially when dealing with insufficient historical data.
Many works have been proposed in API recommendation. For instance, Thung et al. [
3] proposed to learn from issue management and version control systems and compare the literal description of a requested feature with the literal description of the API method. Rahman et al. [
45] proposed recommending a ranked list of API classes for a given query in natural language based on keyword-API mappings mined from Stack Overflow questions and corresponding answers. Rahman et al. [
52] further proposed NLP2API to automatically identify relevant API categories for queries in natural language, such that a query statement can be organized with the obtained API categories to improve API recommendation performance. Raghothaman et al. [
12] proposed the tool SWIM to rank APIs with a statistical word alignment model based on the clickthrough data from the Bing search engine and provide code snippets to illustrate the usage of ranked APIs. Gu et al. [
53] proposed DeepAPI, which adapts a Recurrent Neural Network (RNN) Encoder–Decoder model to transfer a word sequence (a user query) into a fixed-length context vector based on which a sequence of APIs will be recommended.
Works have been proposed to use knowledge graphs for API recommendation. For instance, Wang et al. [
7] proposed an unsupervised API recommendation approach utilizing deep random walks on a knowledge graph constructed from Mashup-API co-invocation patterns and service category attributes. Zhao et al. [
8] proposed KG2Lib, which leverages a knowledge-graph-based convolutional network and utilizes libraries called by projects to make recommendations. The knowledge graph in KG2Lib only includes properties of the third-party libraries (e.g., version, groupID). Kwapong et al. [
54] presented a knowledge-graph framework built from web API and mashup data as side information for API recommendation. These entities (e.g., tags, categories) and relationships (e.g., “belongs_to”, “invokes”) in their knowledge graph are extracted from structured information and do not involve domain knowledge.
All these related works do not incorporate domain knowledge. TOKTER, however, greatly benefits from incorporating domain knowledge, allowing for more precise recommendations of test functions.