A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD

Feng, Ziyang; Tian, Xuedong

doi:10.3390/app132011207

Open AccessArticle

A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD

by

Ziyang Feng

^1,2,3 and

Xuedong Tian

^1,2,3,*

¹

School of Cyber Security and Computer, Hebei University, Baoding 071002, China

²

Institute of Intelligent Image and Document Information Processing, Hebei University, Baoding 071002, China

³

Hebei Machine Vision Engineering Research Center, Hebei University, Baoding 071002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(20), 11207; https://doi.org/10.3390/app132011207

Submission received: 20 September 2023 / Revised: 2 October 2023 / Accepted: 11 October 2023 / Published: 12 October 2023

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Achieving scientific document retrieval by considering the wealth of mathematical expressions and the semantic text they contain has become an inescapable trend. Current scientific document matching models focus solely on the textual features of expressions and frequently encounter hurdles like proliferative parameters and sluggish reasoning speeds in the pursuit of improved performance. To solve this problem, this paper proposes a scientific document retrieval method founded upon hesitant fuzzy sets (HFS) and local semantic distillation (LSD). Concretely, in order to extract both spatial and semantic features for each symbol within a mathematical expression, this paper introduces an expression analysis module that leverages HFS to establish feature indices. Secondly, to enhance contextual semantic alignment, the method of knowledge distillation is employed to refine the pretrained language model and establish a twin network for semantic matching. Lastly, by amalgamating mathematical expressions with contextual semantic features, the retrieval results can be made more efficient and rational. Experiments were implemented on the NTCIR dataset and the expanded Chinese dataset. The average MAP for mathematical expression retrieval results was 83.0%, and the average nDCG for sorting scientific documents was 85.8%.

Keywords:

scientific document retrieval; HFS; knowledge distillation; local semantic; mathematical expression

1. Introduction

With the continuous growth in research across science, technology, engineering, and management (STEM) disciplines, the body of scientific literature, predominantly composed of complex mathematical formulas and equations, has experienced exponential expansion, posing challenges to traditional text-based retrieval methods. The semantics conveyed by one or more mathematical expressions within a document may often encapsulate the core message of the entire document. Simultaneously, in many instances, mathematical expressions cannot be succinctly explained or searched using just a few words. To achieve more ideal retrieval results, an effective approach involves utilizing mathematical expressions for the retrieval of scientific documents [1,2].

Due to the diversity and inherent semantic ambiguity within mathematical expressions, leveraging mathematical information for a range of research tasks—such as mathematical information retrieval and comprehension of mathematical formulas—continues to present a formidable challenge. The distinctive two-dimensional structure adds complexity to their encoding, as retrieval problems encompass not just syntactic term matching but also entail semantic reasoning. To attain a higher degree of precision in matching mathematical expressions during retrieval, it is imperative to carefully contemplate the use of appropriate data structures [3].

Sentence embeddings obtained through training provide dense vector representations. State-of-the-art sentence modeling methods are predominantly based on pretrained language models and find widespread application in various real-world scenarios, such as sentence text similarity and text retrieval. Furthermore, sentence embedding methods based on pretrained language models tend to favor larger model architectures and larger training datasets to achieve enhanced performance [4,5]. Although effective, such large models cause predicaments in terms of inference, as they are computationally expensive and time-consuming to deploy with limited computational resources. They often take on the role of ‘teachers’, transferring their high capabilities to smaller ‘student’ models through knowledge distillation (KD) [6].

In summary, mathematical expressions constitute a semi-formal visual language for conveying meaning. Identifying mathematical expressions presents several challenges in the contemporary landscape. For instance, various representation methods exist for mathematical expressions and pinpointing spatial correlations can be intricate, impeding users’ ability to execute uniform queries. Regarding the fusion of mathematical expressions and their contextual semantics, current approaches rely on large-scale models to obtain their embeddings, leading to substantial computational and inference expenses.

In light of the aforementioned challenges, this proposes a scientific document retrieval and sorting model based on HFS (hesitant fuzzy sets) [7] and LSD (local semantic distillation). In this approach, the document undergoes preprocessing by the document preprocessor module, from which we extract formulas in Presentation MathML or latex format along with their context. Subsequently, mathematical expressions are broken down into their constituent attributes and the corresponding evaluation functions are established. After that, knowledge distillation is implemented on pretrained language models, employing the distilled model to compute contextual semantic similarity. Ultimately, this semantic similarity is harnessed to enhance the retrieval outcomes of mathematical expressions, culminating in the ultimate ranking of scientific documents. This method capitalizes on the strengths of HFS in fuzzy-decision-supporting and the benefits of KD in compressing large models while preserving their performance, thereby elevating the efficacy of scientific document retrieval systems. The contributions of this paper are summarized as follows:

An efficient data structure is designed to preserve the characteristics of the mathematical expression itself and optimize the data storage form, thereby taking advantage of the hesitant fuzzy set in multi-characteristic decision-supporting to obtain the similarity of the mathematical expression.
Based on the teacher–student structure model, a knowledge distillation loss function that combines the intermediate layer hidden units and the output layer logical units is designed to obtain high-quality local semantic information.
By using semantic information to rearrange the retrieval results derived from mathematical expressions, the retrieval results of scientific documents are improved in terms of efficiency and accuracy. Furthermore, the dataset was augmented by including Chinese scientific documents (CSDs).

2. Related Work

2.1. Scientific Document Retrieval

Relying solely on text-based retrieval methods for scientific documents often falls short of meeting the retrieval criteria. It is, in fact, more logical to integrate mathematical expressions during retrieval, given the substantial presence of formulas within scientific documents. Pfahler et al. [8] pioneered the application of unsupervised embedding learning and graph convolutional neural networks to the task of learning mathematical representations. They ingeniously leveraged graph convolutional neural networks to extract mathematical expressions within documents. Experimental findings demonstrated that, in comparison to traditional vector space methods, this approach not only improves similarity precision but also facilitates the retrieval of interdisciplinary literature.

An indispensable strategy for enhancing technology literature search involves obtaining semantically meaningful representations from academic documents. Language models built upon the Transformer architecture have brought about substantial progress in the realm of natural language processing. Notably, models such as SciBERT [9] and BioBERT [10] have showcased their prowess in acquiring comprehensive text representations. In pursuit of obtaining semantically meaningful document-level representations, Razdaibiedina et al. [11] proposed a straightforward approach. By amalgamating the structure of SciBERT with a novel loss function, they fine-tuned Transformer models to acquire high-quality representations of scientific literature. These representations were then utilized for predicting target journal categories based on titles and abstracts. Evaluation results across three distinct datasets showcased the efficacy of this approach, demonstrating that the learned high-quality representations are more conducive to information retrieval tasks.

In the realm of combining mathematical expressions with contextual text, and in response to the challenge posed by pretrained models that tend to overlook the structural intricacies and semantic connections between mathematical formulas and their context, Peng et al. [12] introduced MathBERT. This model was trained jointly on mathematical formulas and their corresponding context. In their efforts, they designed pretraining tasks aimed at more effectively capturing the semantic and structural features of formulas. Pankaj et al. [13] utilized vectors comprising one-dimensional formula embeddings and generalized text for the execution of mathematical information retrieval tasks.

In this study, scientific document retrieval places significant emphasis on both mathematical expressions and contextual semantics. We meticulously analyze each symbol within the expressions, focusing on individual components rather than attributes of multiple sub-expressions. Moreover, leveraging the language comprehension abilities learned by large models to enhance the overall performance of smaller models.

2.2. Hesitant Fuzzy Set

The achievements of hesitant fuzzy sets [14,15] have catalyzed remarkable progress in modeling uncertain knowledge and addressing practical problems. The paramount objective is to encapsulate hesitancy when assigning fuzzy memberships to specific alternative solutions. In certain decision-making scenarios, arriving at a singular, precise decision value proves to be a challenge; instead, a spectrum of values emerges from diverse assessments [16]. When contemplating the multifaceted impact of mathematical expressions on outcomes, focusing on the evaluation results from various dimensions during the intermediate stages can prove advantageous for informed decision-making, as such information cannot be distilled into a solitary weighted value. Mishra et al. [17] introduced an innovative approach for multi-attribute decision-making, founded on Fermatean hesitant fuzzy sets (FHFS). In dealing with fuzzy and imprecise data within the realm of multi-attribute decision problems employing FHFS, they integrated the maximum deviation principle with a generalized distance metric to ascertain attribute weights. When evaluating the reliability of engineering systems, Mahapatra et al. [18] formulated redundancy allocation within the hesitant fuzzy framework as a multi-objective problem. They introduced a set of algorithms, which encompassed hesitant alternative multi-objective particle swarm optimization, rooted in hesitant fuzzy scenarios. Pattanayak et al. [19] introduced an innovative hesitant FTS forecasting model that employs a support vector machine. In the model, fuzzy logic relationships were constructed using each observation alongside its average aggregated membership value. Li et al. [20] devised symbol layout trees for mathematical expressions, extracted sub-expressions, and calculated the membership relationships for multiple attributes on sub-expressions within mathematical expressions to determine similarity between expressions. In this work, the similarity of mathematical expressions is assessed in terms of structure, length, and semantic factors, with a degree of fuzziness in the membership relationships between different evaluation elements.

2.3. Knowledge Distillation

Knowledge distillation serves as a widely adopted model compression technique, with extensive research into methods for extracting knowledge from pretrained models, exemplified by approaches like DistilBERT [21], Bert-PKD [22], and MiniLM [23]. The effectiveness of student models trained through KD matches that of their teacher models, and their smaller model size facilitates more efficient online inference. Moreover, as highlighted by Fitnet [24], the utilization of hidden representations from intermediate layers, in addition to the output layer, has been shown to further enhance the training of student networks. TinyBERT [25] introduced a two-stage distillation method, aiming to distill both the prediction layer and the intermediate layer. It achieved results that closely approached the performance of the teacher model across various natural language processing tasks. Building upon this foundation and to assess the effectiveness of the knowledge distillation model in the document ranking task, Chen et al. [26] proposed two simplification schemes for TinyBERT, resulting in further enhancements to the efficacy of the distilled ranking model. The results from evaluations on the MS MARCO and TREC 2019 DL Track document ranking tasks underscore the potential of knowledge extraction in document retrieval.

In this paper, our attention is directed towards the process of knowledge extraction, with a specific emphasis on transferring local semantics from both deep and shallow teacher models to enhance the semantic matching performance of the student model.

3. Method

Figure 1 presents a visualization of the workflow for the scientific document retrieval model. The Mathematical Expression Matching (MEM) module, utilizing the Formula Description Structure (FDS) [27] algorithm, centers its attention on each term within the mathematical expression. This generates unique expression features, facilitating the computation of similarity between the query expression and the candidate expression. The Content Semantic Matching module initially employs knowledge distillation techniques to transfer the language modeling capabilities of a well-trained teacher model into the student model. Subsequently, it uses the student network to construct a weight-shared twin network, calculating contextual semantic similarity. Ultimately, the similarities from both aspects are combined to determine the reordering results in scientific document retrieval.

3.1. Mathematical Expression Matching

The Mathematical Expression Matching module is designed to enhance the precision of information retrieval by considering both semantic and syntactic variations. Its primary objective is to leverage the HFS to gauge the similarity between a user’s input and the mathematical expressions present in potential scientific documents. Initially, an index is created for the expressions extracted from the scientific documents, which can be in either Latex or MathML format. Subsequently, the FDS algorithm is used to normalize them and construct the corresponding membership functions, forming the hesitant fuzzy features. Finally, generalized fuzzy distance is utilized to assess the degree of match between expressions, resulting in an ordered ranking list.

3.1.1. HFS

The difficulty in determining membership degrees within hesitant fuzzy sets lies not in the precise numeric deviation but rather in the presence of a range of potential values for the same element. Given a fixed attribute set

X = {x_{1}, \dots, x_{n}}

, the hesitant fuzzy set [7] on the attribute set can be expressed as

E = \{〈x, h_{E} (x) | x \in X〉\},

(1)

where

h_{E} (x)

is a hesitant fuzzy element, representing possible membership degree values. The membership values assigned to each object follow a descending order within the range of

[0, 1]

.

When it comes to measuring the similarity between hesitant fuzzy sets, assuming that sets M and N share a common attribute set, the generalized hesitant distance [28] between them can be represented as

d_{g h n} (M, N) = {[\frac{1}{n} \sum_{i = 1}^{n} (\frac{1}{l_{x_{i}}} \sum_{j = 1}^{l_{x_{i}}} {|h_{M}^{σ (j)} (x_{i}) - h_{N}^{σ (j)} (x_{i})|}^{λ})]}^{\frac{1}{λ}},

(2)

where

x_{i}

(i = level, flag, order, operator) is the four evaluation attributes of the term in expression;

l x_{i}

is the length of

x_{i}

; and

h_{M}^{σ (j)} (x_{i})

and

h_{N}^{σ (j)} (x_{i})

are the maximum values in

h_{M}^{} (x_{i})

and

h_{N}^{} (x_{i})

, respectively. The smaller the distance, the more similar M and N are.

3.1.2. Establish HFS of Expression

In order to retain the semantic and structural intricacies of the mathematical expression and to enhance the efficiency of memory management dedicated to storing mathematical indices, the FDS [27] is employed for parsing the mathematical expression. Concretely, for every symbol within the mathematical expression, a hesitant fuzzy set is formed, considering four essential elements: level, flag, order, and operator. These elements are seamlessly amalgamated into a quadruplet along with their corresponding indices, all of which are then meticulously stored in the database.

Consider Q as the query expression,

D_{i} (i = 1, 2, \dots, C N)

as the established expression dataset, and

C N

as the number of mathematical expressions within the dataset. Let

S_{Q q_r}

represent the

q_r

-th symbol in the query expression and

S_{D d_r}

denote the

d_r

-th symbol in the mathematical expression found in the dataset. The membership function definitions [29] for various evaluation attributes are detailed in Table 1.

3.1.3. Expression Similarity Calculation

In this section, the matching process between the mathematical expression in the query and the mathematical expressions within the document set is accomplished by evaluating the similarity of each symbol in the mathematical expression using the HFS algorithm. Leveraging the membership functions of multiple attributes, each symbol in the mathematical expression undergoes assessment, leading to the retrieval and ranking of similar mathematical expressions within the candidate scientific document set. Computing the degree of match between mathematical expressions is depicted in Algorithm 1.

Algorithm 1 Match degree calculation of mathematical expression.

Input:: $Q, D_{i} (i = 1, 2, \dots, C N)$
Output:: $S i m E x p L i s t$ //a list of mathematical expressions similar to Q
1:: $S_{Q q_r}$ //parsed by FDS
2:: $S_{Q q_r}$ //parsed by FDS
3:: for $q s$ in $S_{Q q_r}$ do
4:: for $d s$ in $S_{D d_r}$ do
5:: if $q s = = d s$ then
6:: $v e c = [M F_{l e v e l} (q s, d s), M F_{f l a g} (q s, d s), M F_{o r d e r} (q s, d s), M F_{o p e r a t o r} (q s, d s)];$
7:: $l i s t_{t m p} . a d d (q s, q s . i d, v e c)$
8:: else
9:: $l i s t_{t m p} . a d d (q s, q s . i d, [0, 0, 0, 0]);$
10:: end if
11:: end for
12:: $l i s t_{q s} . a d d (q s, [1, 1, 1, 1]);$
13:: end for
14:: for $t m p$ in $l i s t_{t m p}$ do
15:: if $t m p . q s$ not in $l i s t_{D} . d s$ then
16:: $l i s t_{D} . a d d (q s, t m p . v e c);$
17:: else
18:: if $t m p . v e c > l i s t_{D} . v e c$ then
19:: UPDATE $l i s t_{D} . v e c = t m p . v e c;$
20:: end if
21:: end if
22:: end for
23:: $S i m E x p L i s t = S i m (l i s t_{q s}, l i s t_{D});$
24:: return $S i m E x p L i s t$

3.2. Content Semantic Matching Based on Distilled Model

Intuitively speaking, a mathematical formula transcends being a mere sequence of symbols; it possesses a profound semantic connection with its surrounding context. Traditional approaches to implementing scientific document retrieval through semantic matching tend to furnish document-level representations of the entire text. This often results in an inundation of redundant and unrelated information, introducing unwanted noise and inflating the computational burden of language modeling. Consequently, this can give rise to semantic disparities and sluggish inference speeds. Even the common practice of extracting global keywords, while useful, often falls short in terms of precision and can lead to the loss of a substantial portion of the underlying semantics. Therefore, the adoption of contextual statements proves invaluable in capturing a broader and more nuanced meaning and intent behind mathematical expressions. This approach enhances the precision of interpreting mathematical content, leading to a deeper understanding of the intricate relationship between mathematical expressions and textual information.

On one hand, employing the BERT model for sentence-pair regression tasks has proven to be effective; however, its substantial computational demands pose limitations on its widespread application. On the other hand, a prevalent strategy for tackling clustering and semantic search challenges involves transforming each sentence into a vector space, wherein semantically related sentences are positioned closely together. Nonetheless, the conventional approach of feeding individual sentences into BERT and deriving fixed-sized sentence embeddings often yields surprisingly subpar results. Through strategic modifications to the BERT model, this paper can attain sentence embeddings that carry semantically meaningful information. This transformation renders BERT suitable for a range of novel applications, including large-scale clustering and information retrieval through semantic search, where the accurate representation of sentences is crucial.

3.2.1. Knowledge Distillation Framework

In the KD framework we have adopted, the knowledge refinement stage adheres to the widely recognized teacher–student structure, illustrated in Figure 2. A notable and effective approach involves KD from a BERT-like model into a Transformer structure. The primary objective is to facilitate the transfer of knowledge from a teacher model—often larger and more complex—to a student model, which is typically smaller. The ultimate aim is to ensure that the student model can sustain a level of performance akin to that of the teacher model. At the outset, pretrained language models are leveraged to initialize model weights. The teacher model is constructed as a fixed parameter encoder, comprising T layers of transformer blocks. On the other hand, the student model consists of S layers (where S < T), and its parameters can be initialized using a pretrained model similar to BERT. It is worth noting that in the context of BERT-PKD [22], it has been demonstrated that using the BERT-Base model can be equally effective as employing the considerably larger BERT-Large model. Following model initialization, the proposed method for transferring local semantics within the model is introduced. This step is essential for imparting nuanced knowledge from the teacher to the student. Lastly, the process of knowledge transmission within the output layer and the computation of the comprehensive loss function is described. These final stages of the KD process are critical to ensure that the student model effectively inherits the wisdom of the teacher model and can achieve comparable performance.

3.2.2. Intermediate-Layer Local Semantic Transfer

Regarding the preservation of local structural information, it is crucial that the hidden layers of the student model closely mirror the distribution of the corresponding layers in the teacher model. To ensure the stability of the student model throughout the knowledge transfer process within its hidden layers [25],

L_{m i d}

is designed to measure the distance between the representations of the hidden layers as

L_{m i d} = \sum_{x \in χ} \sum_{l = 1}^{L} K L (i n t e r (f_{S}^{l} (x)) W_{h}, i n t e r (f_{T}^{g (l)} (x))),

(3)

where x is the text input and

χ

is the training dataset. The number of hidden layers is not constrained to have the same number. L represents the number of hidden layers in the student network. The

l_{t h}

hidden layer in the student model is represented as

f_{S}^{l}

; the

g {(l)}_{t h}

hidden layer in the student model is represented as

f_{T}^{g (l)}

; the mapping function

g (\cdot)

represents the

l_{t h}

layer learning information from the

g {(l)}_{t h}

layer;

i n t e r (\cdot)

measures the structural information in the hidden layers; and

W_{h}

is a learnable linear transformer that aligns the hidden states of the student and teacher networks, bringing them into a shared space. The

K L

divergence measures the difference in probability distribution.

3.2.3. Output-Layer Knowledge Transfer

In the distillation process, the student loss is typically computed by merging the Cross-Entropy (CE) loss with Kullback–Leibler (KL) divergence, which quantifies the dissimilarity between the student and teacher model outputs. The divergence between the distributions of the student and teacher models is computed using the KL divergence, facilitating the acquisition of high-quality text representations by the student from the teacher. Additionally, the cross-entropy loss function comes into play to assess the disparity between the student’s predicted labels and the actual labels, offering more precise guidance during the training of the student network. The distillation loss function in the output layer of a student network can typically be expressed as

L_{K D} = K L (σ (\frac{z^{S}}{τ}), σ (\frac{z^{T} k}{τ})) + λ L_{C E} (σ (z^{S}), y),

(4)

where

σ

denotes

S o f t M a x

function;

z^{S}

and

z^{T}

represent the outputs of the teacher and student networks, respectively; and temperature

τ

is a hyperparameter—the larger

τ

is, the smoother the distribution of positive and negative samples. In this paper,

τ = 3

performed well. y represents the true labels, and

λ

is a balancing hyperparameter.

The overall loss function is as follows:

L_{t o t a l} = L_{K D} + β L_{m i d},

(5)

where confidence

β > 0

is used as a balancing factor.

3.2.4. Calculation of Contextual Semantic Similarity

In this module, as inspired by the literature [30], we employ a twin network structure to compute the similarity between the query and candidate items. It comprises identical networks with shared weights following the distillation process, as illustrated in Figure 3. Before feeding sentences into the network, a preprocessing step is performed. For Chinese datasets, segmentation is the initial requirement, whereas English datasets do not necessitate segmentation. After that, stop words are removed, and sentence length is synthesized. The objective is to retain valuable semantics while mitigating the noise introduced by irrelevant data. Subsequently, semantic similarity is computed using the Manhattan distance. Based on experimental analysis, cosine similarity is deemed more suitable for measuring lexical-level semantic similarity as it focuses solely on the angle between two vectors. Conversely, the Manhattan distance [31] is found to be better suited for assessing text similarity at the sentence level and paragraph level. This is attributed to its capability to incorporate richer semantic information such as sentence length. The formula for calculating semantic similarity is shown below:

s i m (Q, C) = e x p (- |V^{Q} - V^{C}|) .

(6)

4. Experimental Process and Result Analysis

4.1. Experimental Data

The distillation processing stage employs the public SNLI and MNLI datasets, while the retrieval of scientific documents is conducted using the public NTCIR dataset along with the extended dataset of Chinese scientific documents (CSDs).

The SNLI dataset comprises 570,152 samples, with each sample consisting of a pair of sentences categorized into one of three classes: neutral, entailment, or contradiction. The MNLI dataset, which extends the SNLI, encompasses 433,000 pairs of sentences and facilitates cross-genre generalization assessment. The public NTCIR dataset, derived from the Wikipedia corpus, comprises 31,742 English documents, encompassing a total of 551,675 mathematical expressions from various design-related fields. To validate the method’s effectiveness, additional CSDs containing 10,372 documents and 121,495 mathematical expressions were introduced to augment the dataset.

4.2. System Experiment

4.2.1. Matching Results of Mathematical Expression Based on HFS

To ensure the reliability of the mathematical expression similarity evaluation method relying on HFS, an extensive array of experiments was carried out, encompassing diverse types of mathematical expressions. Subsequently, to illustrate the effectiveness of the proposed method, a series of statistical analysis experiments were conducted. These experiments revolved around 10 randomly selected query expressions, intentionally chosen to encompass a variety of mathematical symbols and structures, as outlined in Table 2.

In the experiment, the evaluation of the recall and precision of the retrieval results is conducted using the MAP (Mean Average Precision) metric for ten different query combinations. The calculation method for AP (Average Precision) is

A P = \frac{1}{r} \sum_{i = 1}^{r} \frac{i}{p o s (i)},

(7)

where i represents the index of the relevant document,

p o s (i)

signifies the position of the document, and r denotes the total number of relevant documents.

MAP is a widely used metric in information retrieval, calculated as the average of AP scores. It provides a comprehensive assessment of retrieval performance. Table 3 presents the MAP scores for different ‘k’ values in both the Chinese and English datasets. As shown in the table, the MAP decreases gradually as the ‘k’ value increases. This trend is attributed to the smaller number of similar expressions available for certain complex mathematical expressions in the dataset. For some of these intricate or less-common expressions, there are fewer than 20 similar candidates, which results in relatively lower MAP_20 values.

4.2.2. Retrieval Results of Scientific Documents by Incorporating HFS and LSD

Similar to many search engines and retrieval tasks that yield a sequence of results, this research, focusing on the NTCIR dataset and CSD dataset, generates a list of HTML file names. The principle followed here is that the more relevant documents should appear towards the front of the list, meaning they are sorted in descending order of relevance. Henceforth, the evaluation of the sorted retrieval results is conducted using DCG (Discounted Cumulative Gain). The calculation formula for DCG is as follows:

D C G = r e l_{1} + \sum_{i = 2}^{P} \frac{r e l_{i}}{{log}_{2} i} (P \geq 2),

(8)

where i represents the ranking number of the retrieval result;

r e l_{i}

denotes the relevance of the i-th search result, categorized as completely relevant (scored as 4), relatively relevant (scored as 3), partially relevant (scored as 1), or completely irrelevant (scored as 0); and P refers to the total number of retrieval results.

To facilitate the comparison of retrieval result scores across different levels, nDCG (normalized Discounted Cumulative Gain) is employed to standardize the DCG values. The calculation method is as follows:

n D C G = \frac{D C G}{I D C G},

(9)

where

I D C G

is the

D C G

value achieved when the returned list is in its ideal state.

In this experiment, the mathematical expressions and query texts from Table 2 are employed for the statistical analysis of scientific document retrieval. Figure 4 displays the nDCG@10 scores obtained for various mathematical expressions.

4.2.3. Comparative Experiment

To ensure a fair comparison, optimal hyperparameters were selected through grid search or in accordance with the original research papers. SearchOnMath [32] is a specialized tool for mathematical formula retrieval, meticulously designed to accurately match mathematical expressions and retrieve pertinent scientific documents based on these expressions. Tangent-CFT [33], conversely, is a mathematical expression embedding model that utilizes a depth-first search algorithm to transform paths between symbols within the expression tree into a list of symbol tuples. Subsequently, it leverages

F a s t T e x t

to retrieve mathematical expressions through averaging. For the sake of comparative experiments, ColBERT [34]—a representation-based text matching method—was utilized. Figure 5 presents the literature retrieval results achieved using different methods.

Figure 5 illustrates the average nDCG (K = 10) values across 10 query expressions when employing various methods. Notably, SearchOnMath [32] and Tangent-CFT [33] solely rely on the information within the mathematical expressions, without taking into account other attributes of scientific documents. This approach can occasionally result in matches with intermediate results from the mathematical reasoning process, potentially leading to the retrieval of scientific literature unrelated to the intended topic. Conversely, ColBERT [34]—characterized by numerous parameters and a large model—concentrates on modeling text content. However, this approach can introduce noise into the retrieval process, ultimately having a detrimental impact on the results.

The primary objective of the method proposed in this study is to enhance the precision and efficiency of scientific literature retrieval models. As evident in the provided Figure 6, query execution time using different methods is a critical consideration, and it is apparent that the structure of mathematical expressions significantly influences this duration. Specifically, more intricate expressions require more time for execution. Through the process of decomposing mathematical expressions, establishing indexes, and utilizing the distilled model to compute semantic similarity, the method substantially reduces the time required for the model to retrieve the most suitable scientific documents. Consequently, the approach presented in this paper holds a distinct advantage in terms of execution time.

In summary, the method based on hesitant fuzzy sets and local semantic distillation proves to be a highly efficient and accurate approach for retrieving relevant scientific documents in response to queries. This effectiveness stems from HFS’s capability to consider not only the textual content of mathematical expressions but also their structural features, which play a pivotal role in matching both the expressions themselves and their underlying semantics. Furthermore, the model obtained through KD effectively implements the matching of context semantics, resulting in reduced retrieval times. By combining the mathematical expression feature matching module with the use of context semantics to re-rank retrieval results, a significant improvement in the accuracy of scientific document retrieval is achieved.

5. Conclusions

This paper introduces a novel scientific document retrieval method based on HFS and LSD. HFS provides a comprehensive exploration of the relevance between mathematical expressions and offers a fresh perspective on evaluating the similarity of mathematical expressions through individual term analysis. By constructing hesitant fuzzy sets based on the hesitation degree of each symbol in the mathematical expression and assessing similarity using distance measurement, it addresses the limitations of existing methods that rely on single measures for similarity evaluation. In contrast with conventional scientific document retrieval approaches, this paper incorporates LSD. It leverages the extraction of hidden layer local semantics from the teacher model, thereby reducing the gap in local knowledge distribution between the student model and the teacher model. This reduction in discrepancy effectively cuts down retrieval costs. Empirical results underscore the efficacy of this retrieval method, surpassing existing mathematical-expression-based scientific literature search methods in terms of retrieval accuracy and efficiency.

Follow-up work will be carried out in the following aspects:

Investigating various knowledge distillation approaches, striving for optimization in areas such as model architecture and loss functions, and aiming to enhance the accuracy of semantic matching while accelerating model inference.
Delving into the attribute contents of scientific documents from numerous perspectives, proficiently incorporating multi-attribute features during the retrieval process, and striving to further enhance the precision and utility of the scientific document retrieval system.

Author Contributions

Methodology, X.T.; Validation, Z.F.; Formal analysis, Z.F. and X.T.; Investigation, Z.F.; Writing—original draft, Z.F. and X.T.; Writing—review & editing, Z.F. and X.T.; Visualization, Z.F.; Supervision, X.T.; Funding acquisition, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Hebei Province of China (Grant No. F2019201329).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mansouri, B.; Zanibbi, R.; Oard, D.W. Learning to rank for mathematical formula retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 952–961. [Google Scholar]
Nishizawa, G.; Liu, J.; Diaz, Y.; Dmello, A.; Zhong, W.; Zanibbi, R. MathSeer: A math-aware search interface with intuitive formula editing, reuse, and lookup. In Proceedings of the Advances in Information Retrieval: 42nd European Conference on IR Research—ECIR 2020, Lisbon, Portugal, 14–17 April 2020; pp. 470–475. [Google Scholar]
Mallia, A.; Siedlaczek, M.; Suel, T. An experimental study of index compression and DAAT query processing methods. In Proceedings of the Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, 14–18 April 2019; pp. 353–368. [Google Scholar]
Ni, J.; Ábrego, G.H.; Constant, N.; Ma, J.; Hall, K.B.; Cer, D.; Yang, Y. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv 2021, arXiv:2108.08877. [Google Scholar]
Mehta, S.; Shah, D.; Kulkarni, R.; Caragea, C. Semantic Tokenizer for Enhanced Natural Language Processing. arXiv 2023, arXiv:2304.12404. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Torra, V. Hesitant fuzzy sets. Int. J. Intell. Syst. 2010, 25, 529–539. [Google Scholar] [CrossRef]
Pfahler, L.; Morik, K. Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles. arXiv 2022, arXiv:2209.00446. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Razdaibiedina, A.; Brechalov, A. MIReAD: Simple Method for Learning High-quality Representations from Scientific Documents. arXiv 2023, arXiv:2305.04177. [Google Scholar]
Peng, S.; Yuan, K.; Gao, L.; Tang, Z. Mathbert: A pre-trained model for mathematical formula understanding. arXiv 2021, arXiv:2105.00377. [Google Scholar]
Dadure, P.; Pakray, P.; Bandyopadhyay, S. Embedding and generalization of formula with context in the retrieval of mathematical information. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 6624–6634. [Google Scholar] [CrossRef]
Ali, J. Hesitant fuzzy partitioned Maclaurin symmetric mean aggregation operators in multi-criteria decision-making. Phys. Scr. 2022, 97, 075208. [Google Scholar] [CrossRef]
Ali, J. Probabilistic hesitant bipolar fuzzy Hamacher prioritized aggregation operators and their application in multi-criteria group decision-making. Comput. Appl. Math. 2023, 42, 260. [Google Scholar] [CrossRef]
Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
Mishra, A.R.; Chen, S.M.; Rani, P. Multiattribute decision making based on Fermatean hesitant fuzzy sets and modified VIKOR method. Inf. Sci. 2022, 607, 1532–1549. [Google Scholar] [CrossRef]
Mahapatra, G.; Maneckshaw, B.; Barker, K. Multi-objective reliability redundancy allocation using MOPSO under hesitant fuzziness. Expert Syst. Appl. 2022, 198, 116696. [Google Scholar] [CrossRef]
Pattanayak, R.M.; Behera, H.S.; Panigrahi, S. A novel high order hesitant fuzzy time series forecasting by using mean aggregated membership value with support vector machine. Inf. Sci. 2023, 626, 494–523. [Google Scholar] [CrossRef]
Li, X.; Tian, B.; Tian, X. Scientific Documents Retrieval Based on Graph Convolutional Network and Hesitant Fuzzy Set. IEEE Access 2023, 11, 27942–27954. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar]
Wang, W.; Bao, H.; Huang, S.; Dong, L.; Wei, F. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. arXiv 2020, arXiv:2012.15828. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Chen, X.; He, B.; Hui, K.; Sun, L.; Sun, Y. Simplified tinybert: Knowledge distillation for document retrieval. In Proceedings of the Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual, 28 March–1 April 2021; pp. 241–248. [Google Scholar]
Tian, X. A mathematical indexing method based on the hierarchical features of operators in formulae. In Proceedings of the 2nd International Conference on Automatic Control and Information Engineering (ICACIE 2017), Hong Kong, China, 26–28 August 2017; pp. 49–52. [Google Scholar]
Xu, Z.; Xia, M. Distance and similarity measures for hesitant fuzzy sets. Inf. Sci. 2011, 181, 2128–2138. [Google Scholar] [CrossRef]
Wang, H.; Tian, X.; Zhang, K.; Cui, X.; Shi, Q.; Li, X. A multi-membership evaluating method in ranking of mathematical retrieval results. Sci. Technol. Eng. 2019, 8. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Nguyen, H.T.; Duong, P.H.; Cambria, E. Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl.-Based Syst. 2019, 182, 104842. [Google Scholar] [CrossRef]
Oliveira, R.M.; Gonzaga, F.B.; Barbosa, V.C.; Xexéo, G.B. A distributed system for SearchOnMath based on the Microsoft BizSpark program. arXiv 2017, arXiv:1711.04189. [Google Scholar]
Mansouri, B.; Rohatgi, S.; Oard, D.W.; Wu, J.; Giles, C.L.; Zanibbi, R. Tangent-CFT: An embedding model for mathematical formulas. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, Santa Clara, CA, USA, 2–5 October 2019; pp. 11–18. [Google Scholar]
Khattab, O.; Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 39–48. [Google Scholar]

Figure 1. Process diagram of scientific document retrieval.

Figure 2. Local semantic distillation framework.

Figure 3. Text semantic similarity calculation.

Figure 4. Values of nDCG@10 across various datasets.

Figure 5. Comparison results of NDCG (k = 10) values across different methods.

Figure 6. Inference time statistics under different methods.

Table 1. Definition of the membership function.

Evaluation Attribute	Membership Function	Description
Level	$M F_{l e v e l} (S_{Q q_r}, S_{D d_t}) = exp (- \frac{\| l e v e l_{S_{Q q_r}} - l e v e l_{S_{D d_t}} \|}{l e v e l_{S_{Q q_r}}})$	$S_{Q q_r}, S_{D d_t}$ represent the r and t symbols in the query expression and storage expression, respectively.
Flag	$M F_{f l a g} (S_{Q q_r}, S_{D d_t}) = \{(L e v_{f l a g_k}, h_{f l a g_k}) \| \| k = 0, 1, \dots, 8\}$	$L e v_{f l a g_k}$ represents the spatial relationship between $S_{Q q_r}$ and $S_{D d_t}$ . If they are identical, it is assigned a value of 1; otherwise, it is set to 0.
Order	$M F_{o r d e r} (S_{Q q_r}, S_{D d_r}) = exp [- {(\frac{c o u n t_{S_{Q q_r}} - c o u n t_{S_{D d_r}}}{σ})}^{2}]$	$σ$ is a balancing factor that ensures the value of $M F_{o r d e r}$ falls within the range of 0–1.
Operator	$M F_{o p e r a t o r} (S_{Q q_r}, S_{D d_r}) = {(s_{O}, o p e r a t o r_{S_{D d_r}})}$	When $S_{D d_r}$ represents an operator, the value of $S_{O}$ is set to 1; otherwise, it is assigned a value of 0.

Table 2. Statistical expressions and their query statements.

No.	Query Expression	Query Text
1	${sin}^{2} θ + {cos}^{2} θ = 1$	The basic relationship between sines and cosines is called the Pythagorean theorem
2	$f (x) = a e^{- {(x - b)}^{2} / 2 c^{2}}$	The mathematical representation of the Gaussian function
3	$x = \frac{- b \pm \sqrt{b^{2} - 4 a c}}{2 a}$	Two solutions of any quadratic polynomial can be expressed as follows
4	$E = \frac{1}{2} m v^{2}$	The relationship between the change in kinetic energy of an object and the work performed by the resultant external force
5	$lim_{x \to a} f (x) = 0$	ƒ(x) can be made as close as desired to 0 by making x close enough but not equal to a
6	$F_{g} = G \frac{M m}{r^{2}}$	Any two particles have a mutual attractive force in the direction of the line connecting their centers
7	$y = n x^{n - 1}$	Derivatives of power functions
8	$f (x) = \sum_{n = 0}^{\infty} \frac{f^{(n)} (x_{0})}{n!} {(x - x_{0})}^{n}$	The Taylor formula uses the information of a function at a certain point to describe its nearby values
9	$\sqrt{a b} \leq \frac{a + b}{2}$	The arithmetic mean of two non-negative real numbers is greater than or equal to their geometric mean
10	$a x^{2} + b x = c$	first explicit solution of the quadratic equation

Table 3. The MAP values of the expression under different datasets.

Dataset	MAP_5	MAP_10	MAP_20
English dataset	0.912	0.833	0.784
Chinese dataset	0.907	0.827	0.776

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Z.; Tian, X. A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD. Appl. Sci. 2023, 13, 11207. https://doi.org/10.3390/app132011207

AMA Style

Feng Z, Tian X. A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD. Applied Sciences. 2023; 13(20):11207. https://doi.org/10.3390/app132011207

Chicago/Turabian Style

Feng, Ziyang, and Xuedong Tian. 2023. "A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD" Applied Sciences 13, no. 20: 11207. https://doi.org/10.3390/app132011207

APA Style

Feng, Z., & Tian, X. (2023). A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD. Applied Sciences, 13(20), 11207. https://doi.org/10.3390/app132011207

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD

Abstract

1. Introduction

2. Related Work

2.1. Scientific Document Retrieval

2.2. Hesitant Fuzzy Set

2.3. Knowledge Distillation

3. Method

3.1. Mathematical Expression Matching

3.1.1. HFS

3.1.2. Establish HFS of Expression

3.1.3. Expression Similarity Calculation

3.2. Content Semantic Matching Based on Distilled Model

3.2.1. Knowledge Distillation Framework

3.2.2. Intermediate-Layer Local Semantic Transfer

3.2.3. Output-Layer Knowledge Transfer

3.2.4. Calculation of Contextual Semantic Similarity

4. Experimental Process and Result Analysis

4.1. Experimental Data

4.2. System Experiment

4.2.1. Matching Results of Mathematical Expression Based on HFS

4.2.2. Retrieval Results of Scientific Documents by Incorporating HFS and LSD

4.2.3. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI