1. Introduction
Mathematical language is an international, universal language that is not restricted by regions or languages. The main form of mathematical language is mathematical formulas. Mathematical formulas are often the quintessence of technical documents. At present, there are a large number of images of mathematical formulas with research value in web pages or scientific and technological documents. However, they are also mixed with other images, and crawling the page image directly will result in obtaining all the images. If only images containing mathematical formulas need to be obtained, further screening is required.
The essence of mathematical formula image screening is to automatically classify a large number of images into two categories: images with mathematical formulas and images without mathematical formulas. Mathematical formula images can be further divided into two cases: mathematical formulas only and mathematical formulas interspersed between text or coordinate diagrams. The focus of mathematical formula image screening is how to categorize images of situations where formulas are interspersed with text, illustrations, and other elements as well as mathematical formula images.
Traditional image classification techniques [
1] rely on the designer’s prior knowledge and cognitive understanding of the classification task, resulting in worse experimental performance. In recent years, convolutional neural networks have performed prominently in image feature learning [
2,
3,
4,
5]. Convolutional neural networks extract features through autonomous learning, effectively circumventing the many drawbacks arising from complex feature extraction. LeCun et al. [
6] proposed the LeNet-5 network to introduce convolutional neural networks into the field of image classification for the first time [
7]. The LeNet-5 network achieved a classification error rate of 0.8% in the classification task of handwritten digit image recognition, and achieved excellent classification results, confirming the superiority of convolutional neural networks in image classification. However, due to the lack of large-scale training data, limited by the theoretical basis and computer computing power, the recognition effect of LeNet-5 for complex images is not ideal [
8]. Since then, various network models have been proposed for classification tasks. Xie et al. [
9] proposed ResNeXt, a highly modular network architecture for image classification. ResNeXt is a combination of ResNet [
10] and Inception [
11]. Unlike Inception v4 [
12], the ResNeXt structure does not require the complex design of the Inception structure, adopting the same topology for each branch. Zhang et al. [
13] proposed a modular architecture ResNeSt. ResNeSt applies channel attention to different network branches, succeeding in cross-feature interaction. Furthermore, attention mechanisms [
14,
15,
16] are also introduced into image classification tasks. The attention mechanism focuses on important information with high weight, ignores irrelevant information with low weight, and learn independently, and can continuously adjust the weight. Hu et al. [
17] proposed a SENet module that focuses on channel features, and which obtains different weights by learning the relationship between channels. Li et al. [
18] proposed a dynamic selection mechanism for convolution kernels, SKNet. SKNet allows each neuron to adaptively adjust the size of its convolution kernel according to the multi-scale of the input information. Different from SENet, SKNet not only considers the weights between channels, but it also considers the weights of different convolutions in each branch, which is equivalent to incorporating a soft attention mechanism [
19].
Existing feature processing methods and image classification models enhance feature extraction and classification capabilities. However, since the mathematical formula image may contain elements such as text and coordinate diagrams, it is difficult for the existing classification models to meet the classification requirements of the task in this paper. In view of the fact that convolutional neural networks can extract deep features, and the attention mechanism can focus on the information that needs attention in the task, this paper proposes a mathematical formula image screening model based on feature correlation enhancement. The aim is to use the model to screen and collect images containing mathematical formulas for further study. Aiming at the problem that irrelevant features may affect model classification accuracy, a feature correlation enhancement (FCE) module has been designed. First, the FCE module uses a soft attention mechanism to obtain features with channel weights and convolution kernel weights. Then, the features with high weight information are fused into the feature self-attention process, so that the self-attention can strengthen the internal correlation between the mathematical formula features and weaken the contribution of other features in the subsequent stage. Aiming at the problem of image size and the characteristics of the horizontal writing of formulas, a strip multi-scale pooling (SMP) module has been designed. The SMP module integrates strip pooling into the spatial pyramid pooling so that the horizontal writing mathematical formula features get more attention, and can unify the feature dimension, eliminating the constraint that the convolutional neural network needs a fixed-size input. The experimental results of the AttNeSt model have been compared with those of other models, and the results show the superior performance of the AttNeSt model. Using the trained model to screen the test images resulted in screening out most of the images containing mathematical formulas.
The main contributions of this paper are as follows:
For the influence of irrelevant features on the model, a feature correlation enhancement module (FCE) has been designed. FCE enhances the internal correlation of mathematical formula features through the interaction of soft attention and self-attention to reduce the influence of other features on the classification decisions.
Aiming at the problem of inconsistent image size and the characteristic of horizontal writing in formulas, a strip multi-scale pooling (SMP) module was designed. SMP solves the size constraint by integrating spatial pyramid pooling (SPP) [
20] into the network, and then extracting rectangular horizontal features using a strip pooling module (SPM) [
21] to increase the attention of the horizontal structure.
To solve the problem of unbalanced datasets, this paper introduces regularization into the binary cross-entropy loss function [
22]. By cascading regularization, the improved loss function distributes the weights equally to different image features, which avoids the fitting phenomenon and speeds up the model convergence.
The remainder of the paper is structured as follows: the
Section 2 illustrates a mathematical formulation image screening method based on feature correlation enhancement; the
Section 3 analyzes and discusses the experimental results, and the
Section 4 summarizes the paper.
2. Materials and Methods
When designing the network, ResNeSt-50 [
13] was used as the basic framework of the network to ensure that detailed information could be extracted. Compared with ResNet-50 [
23], the essence of the improvement of ResNeSt-50 is the introduction of the split-attention module, which captures the relationship across channels through a channel-based attention mechanism. ResNeSt has achieved excellent results in image classification, object detection, instance segmentation, and semantic segmentation tasks. The network structure of the mathematical formula image screening model (AttNeSt) based on feature correlation enhancement is shown in
Figure 1. In the figure, “Self” represents the self-attention mechanism, and “Soft” denotes the soft attention mechanism. AttNeSt replaces the second layer of convolution in the split-attention module with an FCE structure to strengthen the interaction and correlation of feature information in mathematical formulas and reduce the contribution of useless information. The SMP module is introduced after extracting features. The main idea is to add a set of horizontal stripe pooling to strengthen horizontal features after multi-scale pooling and before feature fusion.
2.1. Feature Correlation Enhancement (FCE) Module
The convolution operation can process the features in the local receptive field but cannot correlate the global information to establish long-distance dependencies. The self-attention mechanism [
24] can capture global information and obtain larger receptive-field and contextual information. SKNet [
18] uses the convolution kernel attention to cause the network to adaptively adjust the size of the receptive field according to the multiple scales of the input information and obtain features with weight information. The feature correlation enhancement (FCE) module weights the feature weight information into the self-attention to increase the degree of association of the self-attention to the mathematical formula features, thus emphasizing the relevance and global dependency of such features and reducing the influence of other features on the classification. The feature correlation enhancement structure is shown in
Figure 2, which is mainly divided into three parts, self-attention structure design, soft attention feature extraction, and feature correlation enhancement.
2.1.1. Self-Attention Structure Design
The self-attention mechanism aims to pay attention to some details according to the target object. The core is how to determine the parts that need attention based on the target, and to further analyze after finding the details. In this subsection, the process of the first stage of self-attention is described.
First, use the feature space mapping functions
,
and
to transform
into three feature spaces
,
and
where
represents the number of pixels. Then, perform matrix multiplication and normalization on
and
to obtain
, as shown in Equation (1), where
and
represents the degree of association between the
th dimension element in
and the
th dimension element in
:
2.1.2. Soft Attention Feature Extraction
First, the feature map
is converted with convolution kernel sizes of 3, 5, and 7 to obtain 3 types of feature information from differnt convolutional kernels,
,
, and
, respectively, and summed to obtain
. Then, global average pooling (GAP) [
25] is used to encode the convolutional layer
to get
,
, and a compression calculation is performed in the
dimension of
to obtain the
th element
in
, as shown in Equation (2). Finally, the weight information of different spatial scales
is obtained:
Use the softmax function for
to obtain
,
, and
.
,
, and
denote the soft attention channel weights of
,
and
, respectively, as shown in Equation (3).
denotes the
th element of
. The same is true for
and
:
, and denote the th row of , and the same is true for and . , denotes the shrinkage rate, denotes the minimum value of . Due to the different image sizes, GAP is used instead of the FC layer, with the advantage of reducing the number of parameters and receiving features of different scales.
The channel weights are multiplied by the feature information of different convolution kernels to obtain the feature vector
, as shown in Equation (4).
fuses the information of multiple receptive fields and increases the weight of the mathematical formula features.
is transformed into a feature map
after ReLU activation and 3 × 3 convolution:
2.1.3. Feature Correlation Enhancement
The correlation between and is established in the first stage of self-attention, and the process is to give attention to the elements with high similarity through elemental interactions. Feature correlation enhancement is the second stage of self-attention. First, is fused with the soft attention feature map with weight information to obtain . is the self-attention feature map with channel weights and convolution kernel weights. Then, the contribution of the convolution kernel and channel weights to the mathematical formulation features is enhanced by calculating the degree of correlation between and . The following is the specific calculation procedure.
The Concat [
26] feature fusion mechanism splices two or more feature maps by channel or dimension. For the task of this paper, splicing on the channel dimension can better express the channel weights of soft attention features. The splicing on the channel dimension can better express the channel weights of the soft attention features, which in turn enables more feature representations of the mathematical formula feature maps in the self-attention features and strengthens the contribution of such features. Concat feature fusion requires equal feature map width and height. In this paper, we solve this problem by upsampling to obtain feature maps
and
with the same
and
.
Assuming that the number of channels of feature maps
and
are
and
, respectively, denote the number of channels of feature
by
, denote the number of channels of feature
by
, and
denotes the tensor of
. The output channel
is obtained after the Concat operation, as shown in Equation (5). At this point, the number of channels in the feature map becomes
, and the feature map is denoted as
:
The matrix multiplication operation is performed between
and
, and then 1 × 1 convolution is performed to obtain the correlation-enhanced self-attention feature map
, as shown in Equation (6):
2.2. Strip Multi-Scale Pooling (SMP) Module
Spatial pyramid pooling [
20] proposes a multi-scale pooling structure to unify the feature dimensions of different inputs, which is widely used in the field of image classification. Strip pooling [
21] proposes a strategy that considers a long but narrow kernel, allowing the network to efficiently model long-range dependencies, focusing on horizontal or vertical features. Considering that the formula part of the mathematical formula image is written horizontally, this paper is inspired by SPP and SPM to design the SMP module. SMP can focus on horizontal information under the premise of unified feature dimension, increase the feature expression ability of mathematical formula, and then improve the classification accuracy. The structure is shown in
Figure 3. (To simplify the image, only two pooling scales are shown in the figure).
In the process of unifying feature dimensions, the size of the pooling kernel varies according to the size of the input image. The pooling kernel (Filter) and step (Stride) are shown in Equations (7) and (8), where Filter is rounded up, and Stride is rounded down.
An example of a Level 3 SPP is shown in
Table 1.
The input size in
Table 1 indicates the size of the feature map input to the SPP structure, corresponding to
m × m in
Figure 3, and the output size indicates the desired output size, corresponding to
l × l in
Figure 3. With the three-level SPP structure, two feature maps of different sizes obtain the same length output. If the input size changes, the pooling kernel and step size will also change to ensure that the output has the same length. Taking the input size 10 × 10 as an example, the feature map is subjected to three pooling operations with convolution kernels of 10, 4, and 2 and step sizes of 10, 3, and 2 to obtain the outputs of 1 × 1, 3 × 3, and 5 × 5, respectively. Similarly, the 15 × 15 feature map uses different convolution kernels and step sizes to obtain 1 × 1, 3 × 3, and 5 × 5 outputs.
After obtaining the feature map of multi-scale pooling, then strip pooling is performed. The realization process of horizontal strip pooling is shown in the dotted box in
Figure 3. First, the
-scale feature map is transformed into
-scale features after horizontal strip pooling. The implementation is to calculate the average pixel value on the horizontal feature map corresponding to the pooling kernel, as shown in Equation (9), where
:
Then, the convolution operation with a convolution kernel size of Filter is used to expand along the top and bottom, and the expanded feature map is the same size as the original feature map. After the 1 × 1 convolution operation and Sigmoid activation, the feature map is obtained by multiplying with the corresponding pixels of the original feature map. The feature map is fused into a feature vector . Images of all sizes are unified into a fixed dimension in and input into the fully connected layer.
The horizontal strip pooling considers the horizontal range rather than the whole feature map, reinforcing the information about the position of the formulas written horizontally in the feature map. Since the weights of the target features (mathematical formula features) have been increased during feature extraction, SMP pays more attention to the horizontal formula features and less attention to the text, which is also a horizontal feature.
2.3. Loss Function
The images in the dataset in this paper are randomly crawled from the network, and the images are in various forms. To balance the dataset and improve the model accuracy and generalization performance, this paper improves the binary cross-entropy loss function (BC). The binary cross-entropy loss function formula is shown in Equation (10), where
N denotes the total number of samples,
y denotes the label of sample
i, and
pi denotes the probability of sample
i being predicted as category 1:
Incorporating regularization in BC [
27,
28]. The
L1 and
L2 regularization [
29,
30] are shown in Equations (11) and (12), where
is the regulating factor between the loss function and the regularization term,
is the number of samples in the training set of the model, and
is the weight parameter of the model:
When only
L1 regularization is used, the same penalty is given to all weight parameters. When only
L2 regularization is used, a large penalty is given to parameters with larger weight parameters, and a small penalty is given to parameters with smaller weights. The improved binary cross-entropy loss function (IBC) is shown in Equation (13).
is the absolute value of the weight parameter, is the 1NF of the weight parameter , is the square of the 2NF of the weight parameter , is the adjustment factor between the loss function and the L2 regularization, is the adjustment factor between L1 regularization and L2 regularization, which degenerates to L2 regularization if and to L1 regularization if . L1 regularization is to extract one of the features randomly and drop the other features. L2 regularization is the mean selection when the image features present a Gaussian distribution. Therefore, in IBC, L1 regularization is introduced for feature selection, and then L2 regularization is introduced to deal with the image features of covariance, and the weights are equally divided among various image features through a cascade of regularization to retain useful features.
4. Conclusions
To screen images containing mathematical formulas in web pages or scientific documents, we designed a network model, AttNeSt, that can screen images containing mathematical formulas from among many kinds of images. First, the feature correlation enhancement (FCE) module was designed with the aim of improving the contribution of mathematical formula features in the self-attention feature maps. Then, the strip multi-scale pooling (SMP) module was designed to cause the input images to retain their original sizes and to focus on horizontal formula features. Finally, regularization is incorporated into the binary cross-entropy loss function to balance the dataset. The experimental results show that the
ACC of AttNeSt is 7.18% better than ResNeSt. The superior performance of AttNeSt compared with other methods. Good results were obtained using the trained AttNeSt network model to screen the blended images, as shown in the results of
Section 3.4.2 and
Section 3.4.4 For images where mathematical formulas are interspersed with text or illustrations, etc., the model is able to screen most of these images correctly.
Although the model in this paper can accurately screen images containing mathematical formulas in most cases, there are errors. For example, mathematical formulas are also present in
Figure 8g, but they were classified into category 0. The reason for the classification error is that the mathematical formula part of the figure is too small in proportion to the rest of the image. In addition, embedded formulas, edge formulas not easily recognized, and formulas interspersed with text where features are difficult to distinguish may also cause classification errors. The follow-up work will include continued theorizing about new approaches to improve the experiment and achieve better screening results.
The trained AttNeSt network model can screen out images containing mathematical formulas from a large number of images, which helps to facilitate the creation of a database of mathematical formula images. In subsequent work, images within a large number of relevant documents or web pages will be crawled, and images containing mathematical formulas will be screened using the AttNeSt model. This will increase the number of mathematical formula images, which will be helpful for the training of mathematical formula retrieval, formula extraction, and identification, and thus making it more beneficial for readers to be able to retrieve the mathematical formulas they want.