1. Introduction
In the field of natural language processing (NLP), relation extraction (RE) represents a significant research domain. The primary objective of RE is to empower models to discern the attributes of relation categories through the analysis of training data, thereby facilitating the identification of relations between entities within the textual context. As depicted in
Figure 1, both supervised RE [
1,
2,
3,
4] and few-shot RE [
5,
6,
7,
8] are key methodologies. In the case of supervised RE, models can learn the features of the relation types directly from the training data. In contrast, few-shot RE leverages a limited quantity of labeled data from the support set for training or fine-tuning, which requires the models to accurately classify categories with only a few available samples. Consequently, the construction of few-shot RE datasets is more challenging and complex, involving the selection of samples that not only require a sufficient number, but also ensure variety to generalize effectively.
Currently, there are dozens of widely used supervised RE datasets, such as CoNLL04 [
9], SemEval-2010 Task 8 [
10], TACRED [
11], etc. However, for few-shot RE, the most widely used dataset is FewRel [
12]. Moreover, current publicly available few-shot RE datasets are primarily focused on English, resulting in limited research on few-shot RE for other languages. In Chinese few-shot RE research [
13,
14,
15,
16], the datasets used in experiments are typically constructed individually, leading to significant variations between them. This lack of consistency has resulted in the absence of a standardized evaluation framework.
To address these challenges, we introduce a few-shot RE dataset for Chinese (FREDC), which encompasses nearly 100 types of relations across general, medical, and financial domains. By integrating and optimizing existing high-quality Chinese information extraction datasets, FREDC is designed to meet the diverse needs of few-shot RE research, including tasks such as cross-domain and “None-of-the-Above” (NOTA), as defined in the FewRel benchmark. Unlike previous Chinese few-shot RE datasets, FREDC is not restricted to a specific research setup.
Additionally, in experiments related to cross-domain, the complexity of Chinese data in the economic and medical fields makes it difficult for models to determine relations based solely on entity labels, adding to the challenge of the FREDC dataset. To validate the effectiveness of FREDC, we made structural adjustments to several few-shot models that were originally designed for English, enabling them to process Chinese text to some extent. We conducted a series of experiments using these models, and comparative results showed that FREDC yields effective results across various few-shot RE settings. FREDC presents certain challenges for models compared to English datasets. In our experiments, we use accuracy as a performance evaluation metric. The maximum accuracy difference is about 30% in complex tasks, such as cross-domain and NOTA tasks. Furthermore, our experiments also demonstrate that as the number of samples of the support set increases, the impact of language differences between Chinese and English gradually diminishes.
The contributions of this article are as follows:
We presented the FREDC, which is built from Chinese linguistic resources and specifically designed for the purpose of few-shot RE. It encompasses nearly a hundred distinct relational categories and incorporates specialized knowledge from multiple domains, thereby substantially mitigating the scarcity of publicly accessible Chinese datasets for few-shot RE.
We have aligned the format of FREDC with the current mainstream datasets, allowing it to be directly applied to various few-shot task settings, such as cross-domain and NOTA. Additionally, experimental verification has been conducted to demonstrate the effectiveness of the dataset.
We have modified the architecture of some existing baseline models to enable these models, originally designed for English data, to process Chinese text. This provides valuable baselines for the few-shot RE in Chinese.
2. Related Work
The quantity and quality of datasets are pivotal in the progression of RE. The research conducted by Bassignana [
16] on RE datasets has revealed a significant evolution in the target domains over the past five years. Initially focused on news texts, the emphasis has shifted to web texts, and more recently to scientific publications and Wikipedia. Concurrently, there is an increasing emphasis on multilingual and few-shot methodologies. There are currently approximately 17 mainstream datasets, but datasets specifically tailored for few-shot RE are actually very scarce. From our survey, we can see that that the majority of datasets for few-shot tasks are based on English. Furthermore, researchers approach dataset construction with two main objectives: to ensure the quality of the dataset and to introduce new challenges through it. Quality assurance involves focusing on the volume of data and the accuracy of annotations. Introducing challenges involves enriching data diversity and increasing the complexity of identification, thereby enhancing the ability of models to address real-world problems.
The first large-scale dataset introduced for English few-shot RE research is FewRel 1.0, which has become the most frequently used dataset in the field. Developed by Han et al. [
12] using the Wikipedia corpus, FewRel 1.0 encompasses nearly a hundred relation types and tens of thousands of high-quality annotated data, identified through distant supervision and refined by crowd-workers. Building upon this, Gao et al. [
17] introduced FewRel 2.0, an enhanced and expanded version of FewRel 1.0. It includes a new test subset from various domains and introduces NOTA, enabling further investigation into whether models can adapt to new fields with limited examples or recognize new relations (NOTA). Recently, Borchert [
18] presented CORE, a few-shot RE dataset focusing on corporate relations and entities. In the dataset, the information richness embedded in business entities allows models to focus on contextual nuances, reducing their reliance on superficial clues such as relation-specific verbs, which improves the model’s ability of cross-domain RE. Besides datasets specifically designed for few-shot RE tasks, there are also some few-shot RE datasets that are generated through adaptations of supervised RE datasets. Sabo et al. [
19] argued that the evaluation framework of FewRel is an unrealistic benchmark. Thereby, they propose a novel methodology for deriving more realistic few-shot test data from datasets for supervised RE, which converted the supervised TACRED [
11] into a few-shot TACRED variant by applying realistic episode sampling. Popovie et al. [
20] argue that document-level corpora provide more realism, particularly regarding NOTA distributions. Therefore, they present FREDo, a few-shot document-level RE (FSDLRE) benchmark. The FREDo used the documents in the development corpus of DocRED [
21] as the test corpus for our in-domain task. The DocRED training corpus is therefore used as the basis for both our training and development set. sciERC [
22] is used as our cross-domain test subset.
When compared to the extensive availability of English few-shot RE datasets, the Chinese counterparts remain significantly scarce. The majority of Chinese few-shot RE datasets are constructed by researchers for their individual studies. At present, the research in the field of Chinese few-shot RE is primarily focused on addressing the issues within existing few-shot RE models and exploring new approaches for Chinese. Ji et al. [
13] built a Chinese dataset based on the FinRE [
23] and sought to improve model performance with noisy data. FinRE contains data from 2647 instances of financial news, with 13,486, 3727 and 1489 relation instances for training, testing and validation, respectively. FinRE contains 44 relationships including a special relation NA, which indicates that there is no relation between the marked entity pair. In preparing the dataset for few-shot RE tasks, the training and testing subsets were reclassified based on relation types. Specifically, 30 relation types were allocated for training, while 14 were reserved for testing and validation. But this dataset is not applicable to most RE studies due to the fact that the data in this dataset are derived from a single domain and noisy data have been added. Ren et al. [
14] have delved into few-shot RE for Chinese-specific domains. They initiated this by creating Bridge-FewRel, a dataset derived from 1300 actual bridge inspection reports. The dataset consists of 8430 sentences that have been meticulously labeled by hand. It encompasses four entity categories and sixteen relationship categories, with each of the training and test sets including eight distinct relationship types. Following this, they developed TinyRel-CM, a dataset compiled from data sourced from Chinese healthcare websites., manually screened and labeled by five TCM students. The dataset was constructed with a total of four entities and 27 different binary relationships, where each relationship contains 50 instance samples. While these two datasets extend the study of few-shot RE on domain-specific scenarios, the datasets are smaller overall in terms of entity and relationship types, and the relationship definitions are simpler. Wu et al. [
15] have developed a Chinese social media dataset for few-shot RE. They gathered and annotated the data from various sources like web crawling and open-source projects, resulting in a dataset with 6000 entries and 10 + 1 (NOTA) categories. Each entry consists of texts with five distinct attributes. Furthermore, Li et al. [
16] presented the TinyACD, a dataset for few-shot RE in ancient Chinese texts. The dataset was compiled from “Shih Chi”, featuring 1600 data entries that were manually annotated by historians. This dataset has 11 entity pair types with 32 relation classes, of which twenty-four relationships are used in the training set and eight relationships are used in the test set. The dataset comprises sentences that are structurally intricate and frequently contain abbreviations. To ensure the highest level of quality, a third round of evaluation has been incorporated. The goal of TinyACD-RC is to mimic digital humanities research scenarios, offering a foundation for studying relations in ancient documents. Due to the specificity of the content of this dataset, the model can only learn the features of ancient Chinese through this dataset, which leads to a narrower use of this dataset.
In the previously mentioned Chinese few-shot relation classification datasets, there are several notable problems. Their applicability is limited due to being constructed with specific research objectives in mind; the total volume of data and the variety of relations covered are both insufficient; and they often fail to meet the requirements of multiple few-shot tasks at the same time. To address these issues, our research reorganizes and enhances various existing information extraction datasets to create a dataset of superior quality. This dataset encompasses a rich diversity of relation types and a substantial amount of data. Our experimental results indicate that this dataset is suitable for a wide range of few-shot learning tasks and presents a new baseline of comparison for Chinese few-shot RE tasks.
3. FREDC
3.1. Data Collection
In order to ensure the integrity and efficacy of the dataset construction process, we have elected to modify and calibrate the pre-existing information extraction dataset to suit the demands of few-shot RE tasks. Furthermore, to fulfill the criteria of cross-domain experimentation, it is imperative to establish a distinct variation in relation types between the test subset and the training set. As such, we have designated general domain data as the principal training set, while utilizing medical and financial domain data for cross-domain testing purposes.
We anticipate that FREDC will meet the requirements for various experimental configurations. The general domain data are not only essential for training but must also support intradomain testing. Therefore, the dataset should include a substantial volume of data and cover a wide range of relation categories. In the current landscape of open-source large-scale Chinese information extraction datasets, there are only two to three widely used Chinese relation extraction datasets that have been professionally annotated by human teams and are openly accessible. Among these, DullE2.0 stands out for its substantial data volume and broad coverage across domains, which facilitates the expansion of features acquired by models during the learning phase. For cross-domain testing data, we chose to focus on datasets from the economic and medical fields, as these are commonly used in research studies. Due to the sensitive nature of content within both sectors, available datasets are even more limited. Therefore, we selected the openly accessible CMeUE-V2 and BBT-FinRE datasets as sources for our cross-domain testing. The three chosen data sources all feature relatively conventional relation types. This characteristic makes it easier to expand and modify them appropriately by collecting existing data from the web. The details of the data sources are shown in
Table 1.
DuIE2.0 [
24], a large-scale Chinese RE dataset, serves as a suitable foundation for general domain data. It is sourced from Baidu Encyclopedia, Baidu Information Flow, and Baidu Tieba, encompassing diverse content such as news reports, entertainment information, and user-generated content. The data are filtered and annotated by crowd-workers, resulting in a dataset with 210,000 sentences and 450,000 instances, covering 49 commonly used relations.
To obtain medical data for cross-domain testing, we selected CMeIE-V2 as the primary data source. This dataset includes approximately 75,000 triplets, 28,000 disease-related statements, and 53 distinct relation categories. It primarily focuses on pediatric information and over a hundred common diseases, providing details on 518 pediatric and 109 frequent illnesses. Notably, CMeIE-V2 incorporates numerous Chinese technical terms and disease-specific vocabulary, such as “multiple myeloma”, “non-small cell lung cancer”, and “electrophoresis bone marrow examination”. These complex terms present significant challenges for text processing, adding complexity to the cross-domain RE task. For financial data, we selected the RE section of the Chinese open-source corpus BBT-Fin [
25]. This dataset contains over 10,000 sentences and 44 relation categories, including finance-specific ones like “shareholding” and “acquisition”. The corpus is derived from various financial documents, ensuring a high degree of professionalism and reliability. In the financial sector, entities such as “Coca-Cola” can represent a company, a stock, or a product, depending on the context. Accurate interpretation of these entities within their specific contexts is essential for effective relation extraction.
3.2. Data Optimization
After selecting the appropriate data sources, we analyzed their contents and identified several issues. First, we needed to standardize the dataset format to meet the requirements of the few-shot RE task. This process involved categorizing data by relation types and redistributing it into training, validation, and test subsets. To achieve this, it was necessary to re-screen the dataset to determine the relation type for each entry. Furthermore, to prevent the model from overfitting specific relations during training, it was essential to maintain a balanced distribution of relation types in the training data and ensure similar balance in the validation and test datasets for accurate evaluation.
For the DuIE2.0 dataset, two main issues were identified. First, some sentences contained multiple relation types, and others were embedded with complex multi-relations. Since current few-shot RE research primarily focuses on sentences with a single relation type, we initially excluded sentences with multi-relations. We then filtered and reconstructed these sentences to ensure that each sentence in the dataset contained only one relation type. We also adjusted the amount of data for each relationship type in both the training and test sets to ensure a balanced distribution of data across the types. After this processing, we curated a set of 38 viable relation types, comprising over 100,000 instances of data.
The CMeIE-V2 dataset faced similar challenges with multi-relation sentences and also included some relation types with insufficient samples, limiting the construction of support sets. Due to the specialized nature of these relation types, extensive expansion was impractical, leading us to exclude them. In addressing relationship types with scalable data volumes, we employed ChatGPT to generate sentences for the designated types. This approach yielded nearly 700 additional pieces of data. After processing, we identified 29 usable relation types, totaling around 4000 entries.
In the BBT-Fin Corpus, entity types were initially undefined, with only relation types specified. As a result, we first defined the entity types corresponding to the relations. Additionally, the dataset included passive relation types, such as “acquisition” and “being acquired”. We opted to retain only the active behavior representations. Following further refinement, we finalized 22 usable relation types, amounting to approximately 3000 entries. To preserve the balance of the test set within this domain, we collected and manually annotated 200 pieces of data from 67 financial news items in Sina Finance.
3.3. Dataset Analysis
The structure of FREDC, as depicted in
Figure 2, encompasses data from three distinct fields, complete with their respective training, validation, and test subsets, and the associated triplet configurations. Furthermore, the dataset also encompasses comprehensive explanations of all relation types.
Additional details of FREDC are presented in
Table 2. Using general domain data for training, we filtered out 28 relation types for the training set. Ten other relation types were allocated for testing. Data from two specialized fields were reserved to evaluate the cross-domain RE capabilities of the models, requiring only division into validation and test subsets. To support subsequent few-shot RE experiments, we designated 10 relation types for the validation subset within each domain, enabling a foundational 10-way-1-shot test. After excluding relation types used for validation in both domains, the remaining data were allocated to the test subset. As a result, the medical domain contained 19 relation types for testing, while the financial domain included 12. Additionally, the validation subset data from both domains could be repurposed as test data for cross-domain evaluations.
4. Experiments
4.1. Task Formulation
We employ N-way-K-shot evaluation for our few-shot RE experiments, with the accuracy of the predictions serving as our primary evaluation metric. In these experiments, the training, validation, and test subsets of the dataset are distinct in terms of relation types. Model evaluation is performed by extracting batches from the test subset, with each batch consisting of (R, S, x, r), where R is the extracted set of relations, r is the correct relation label for the query instance x, and S is the support set containing K instances for each relation in R. When performing few-shot RE, the model predicts the relation label for the query instance x based on S and R, and verifies its correctness against r to assess the effectiveness of models. The accuracy formula for the predictions of models is as follows:
where TP means the prediction is positive and correct; FP means the prediction is positive but incorrect; FN means the prediction is negative but the actual case is positive; and TN means the prediction is negative and correct.
In our primary baseline model testing, we utilized two evaluation setups: 5-way-1-shot and 10-way-1-shot. We also incorporated cross-domain RE evaluation into this experiment. The cross-domain RE experiment emphasizes applying RE technology across various fields or datasets, primarily tackling challenges like scarce data annotation and efficient model migration between domains. This test effectively assesses the generalization capabilities of models. Furthermore, we compared the performance of different models on FREDC with FewRel2.0 in both cross-domain and NOTA experiments. The NOTA experiment takes into account real-world scenarios where the model might encounter sentences lacking a specific relation or not belonging to the given relation set. Since these sentences are predominant in texts, the experiment introduces the NOTA relation, obliging the model to make judgments on it. In the specifics of the NOTA experiment, we must introduce unseen relation types from the support set into the test set during the testing phase, and set a threshold to regulate the proportion of added relation types.
4.2. Baselines
In order to effectively evaluate the usability and effectiveness of FREDC, we have selected several leading few-shot RE models as our baseline references. Since there are currently limited studies on Chinese and most of them are not open-source, it is difficult for us to obtain Chinese few-shot relation extraction models. In addition, in these closed-source studies, most of the models are optimized for personal research directions and may not be suitable for constructing evaluation baselines. Therefore, considering the above factors, we selected some English few-shot relation extraction datasets with good performance for testing. Since these models have been verified by most studies and have a certain degree of persuasiveness, experiments on these models can obtain valid data, which can better illustrate the effectiveness of our dataset. Snell et al. [
26] introduced Proto, which generates prototypes for relation types using examples from the support set, subsequently matching query instances with these prototypes by calculating similarity scores. Gao et al. [
17] developed the BERT-Pair model, which assesses whether a query instance has the same relation type as a support instance by concatenating the two and computing the similarity probability. Baldini et al. [
27] proposed the BERTMB model, which employs an unsupervised relation representation learning approach known as “masking entities”, where entities are represented with specific markers at their corresponding positions. This model learns semantic relations by randomly substituting entities, exhibiting strong generalization capabilities without requiring additional fine-tuning for RE tasks. Han et al. [
28] proposed the HCRP model, which focuses on learning challenging relation types through a hybrid strategy that incorporates additional data from relation descriptions while creating prototypes. HCRP dynamically adjusts weights based on the tasks at hand and utilizes contrastive learning techniques to improve the discriminative power of the prototypes.
Since these baseline models are designed for English few-shot knowledge extraction and are not directly applicable to Chinese content, the model architecture needs to be adjusted before conducting specific experiments. We opted to use Bert-Chinese [
29] to replace the original encoder in these models, enabling them to effectively process Chinese text. Additionally, we made specific structural adjustments for each model to ensure they could successfully perform tasks related to Chinese RE.
4.3. Results
In our experiments with FREDC as the baseline model, we observed that the RE performance of various models aligns well with their performance on the original dataset. Notably, in the context of 5-way-1-shot in-domain RE tasks, FREDC demonstrated an extraction accuracy rate exceeding 50% across multiple models. It is evident that the models exhibit superior performance on in-domain tasks compared to cross-domain, attributable to the inherent characteristics of FREDC.
However, the BERTMB model, which is optimized for generalization, exhibited a less significant performance disparity between in-domain and cross-domain experiments on our dataset. The performance demonstrated by the models on FREDC in the cross-domain experiments is based on the average of the prediction correctness in both the financial and healthcare domains. Comprehensive experimental results are presented in
Table 3.
Furthermore, to more thoroughly examine the performance disparities of our model across Chinese and English datasets, we have conducted cross-domain experiments using FREDC and FewRel2.0, along with NOTA experiments. In the cross-domain experiment, the test subset of FewRel2.0 was only the biomedical sector, whereas our test subset includes financial and medical fields. The cross-domain experiments using FREDC show that both models perform poorly in the 5-way 1-send and 10-way 1-send configurations, with a performance gap of about 20% compared to FewRel2.0, as illustrated in
Table 4.
In the NOTA experiments, we examined scenarios with thresholds set at 0% and 50%. Models exhibited a performance gap of approximately 35% on the two datasets when the threshold was set at 50%. However, it is noteworthy that at 0% NOTA and 5-way-5-shot settings, the discrepancy in experimental results between the two datasets was approximately 10%, whereas at 0% NOTA and 5-way-1-shot settings, this discrepancy increased to approximately 25%, as detailed in
Table 5.
4.4. Analysis of Results
Our research employed a diverse array of experimental configurations to conduct a comprehensive evaluation of existing models’ performance on the FREDC dataset. In the baseline model testing phase, all models produced valid experimental results that were in line with their anticipated performance levels, thereby underscoring the utility and effectiveness of the FREDC dataset. However, it is evident that FREDC presents significant challenges to these baseline models, particularly in cross-domain experiments that involve data from two specialized fields, where the majority of models exhibited performance levels below 30%.
In the cross-domain tasks, we conducted a comparative analysis of the experimental outcomes of the Proto and BERT-Pair models across two distinct datasets. The experimental findings indicate a decrease in accuracy of approximately 15–30% when processing Chinese text as opposed to English. This discrepancy can be attributed, in part, to our Chinese cross-domain test subset, which provides a comprehensive assessment of the RE capabilities of models across two domains, presenting a task of greater complexity compared to FewRel2.0. In our experimental investigations of the NOTA task, we evaluated the performance of the BERT-Pair model across various datasets. The results suggest that the FREDC dataset remains effective in the context of NOTA experiments. However, the introduction of NOTA in the testing phase revealed that the Chinese dataset had a more pronounced impact on model accuracy than the English dataset, with a performance discrepancy of approximately 20%. It is noteworthy that under the 5-way-1-shot and 5-way-5-shot configurations with NOTA of 0%, the performance differential between the Chinese and English datasets was significantly distinct, implying that with adequate data support, the performance gap between these datasets can be substantially reduced.
The FREDC dataset, specifically curated for Chinese few-shot RE tasks, exhibits unique attributes; within the medical domain, a significant proportion of relations are disease-related, leading to multiple relation types that are prone to confusion. For example, certain relations may involve recurring entity types or identical head and tail entity types, as illustrated in
Figure 3a. In the financial domain, majority relations predominantly involve two companies, thereby limiting the range of head and tail entity types. Additionally, the same entity may assume different meanings in various contexts, and altering the positions of the head and tail entities can result in new relations, as depicted in
Figure 3b. Given these specialized characteristics of the FREDC dataset, reliance on entity differentiation alone is insufficient for predicting relations. The model must integrate semantic information to infer relations between entities, which substantially complicates the task of few-shot RE.
In addition, some performance loss may stem from the limitations of our work. When different baseline models make predictions, we observe the following common issues, with specific error output details shown in
Table 6.
The main demonstration is that our modified baseline model is deficient in two main areas. Firstly, the encoding methodology employed by BERT-Chinese operates at the character level for Chinese, in contrast to the word-level encoding utilized for English, which may impede the ability of models to grasp the nuances of word meanings. For example, in the sentence in Input 1, under the single-word coding approach, the model processes each character of “Pudong New Area” as a separate input, which may be recognized as a specific institution under an administrative district, so the predicted relationship is footprint. Secondly, the complexity inherent in Chinese grammar and sentence structure, in conjunction with the distinctive features of the dataset in question, contribute to the observed variations in model performance. For example, in the sentence of input 2, the sentence structure is more complex and contains multiple entities and relations. The model may assign the first-appearing “Compete” to the entity pair as the relation. In addition, some problems may come from the difficulty setting of the dataset. For example, in the sentence in input three, it is difficult for the model to understand the relationship between the two entities containing histological examination through a few examples, so it outputs an error message.
5. Conclusions
This study introduces the FREDC dataset, a Chinese few-shot RE dataset, which aims to mitigate the scarcity of Chinese datasets in the field of few-shot RE research. Comprising over 100,000 pieces of data across multiple relationships under the general, financial, and medical domains, FREDC is not only applicable in addressing the basic challenges of few-shot RE but also supports the experiments of cross-domain and NOTA in few-shot settings. Through a series of experiments, we demonstrate a broad applicability of the dataset, and establish a new evaluation benchmark for Chinese few-shot RE, providing a vital research foundation for advancing Chinese few-shot RE.
There are also some limitations in this study. Firstly, most few-shot RE studies with large datasets conduct experiments by selecting relationship types from the test set, and then obtain the final results by averaging the experimental outcomes. Our dataset is currently limited in the variety of relationship types it encompasses. Second, there are some relationships in our data that are difficult to set up, which can obscure the true performance of the model. Furthermore, the baseline comparison data we provided represent only the fundamental performance metrics of these models, and the performance they demonstrate in this study is limited by a number of aspects. A better adjustment of these baseline models’ architectures has the potential to augment their predictive accuracy. Future work could consider more advanced pre-trained models, especially those trained on Chinese corpora, such as BERT-Chinese, to better capture the semantics and context of the Chinese language. Additionally, word-level representations could be explored by tools of tokenization, which may enhance the model’s understanding of semantic units, particularly for extracting relationships in phrases and fixed expressions. For ambiguous characters or domain-specific terms in Chinese, constructing custom dictionaries or dynamically adjusting the tokenization model could improve the expressive power of the corpus. To address the complexity of Chinese grammar, syntactic dependency parsing tools could be integrated to incorporate syntactic structure information into the model, aiding in the understanding of semantic relationships within sentences. Furthermore, multi-task learning approaches or integrating syntax and semantics specific to Chinese during the pre-training phase could enhance the model’s ability to adapt to complex grammatical structures.