The text recognition algorithm designed in this paper mainly includes two modules: the text position correction module and the encoder-decoder network module. The TPC module corrects the detected oblique text into horizontal text, then the EDN module recognizes horizontal text. EDN module includes the encoder network (EN) and the decoder network (DN). The EN uses the dense block and two-layer BLSTM [
17] methods to extract text features, and can generate feature vector sequences with character context feature relations. The DN uses the attention mechanism [
18] to weight the encoded feature vectors, which can make more accurate use of character-related information. Then, through a layer of LSTM [
19], DN adopts the output of the previous moment and the input of the current moment to jointly determine the recognition result of the current moment. The overall structure is shown in
Figure 1.
2.1. Text Position Correction Module
TPC is the main research method for the oblique text recognition, which corrects the oblique text into the horizontal text, and then carries on the recognition to the horizontal text. Most of the traditional text position correction methods are based on affine transformation [
20], which has good effects on text with small tilt angle, but bad effects on text with large tilt angle and are difficult to train. In the study of text recognition algorithm, this paper proposes an improved TPC method based on the idea of variable convolution two-dimensional offset [
21] and offset sampling [
22], which is a coordinate offset regression method based on CNN. It can be combined with other neural networks to complete end-to-end training, and the training process is simple and fast. The detailed structure is shown in
Figure 2.
As can be seen from
Figure 2, the TPC process of this paper is as follows: Firstly, a pre-processing step is carried out to process the input text into the same size, which can speed up the training process of the algorithm. Secondly, the spatial features of pixels [
23] are extracted by CNN to obtain a fixed size feature map, in which each pixel corresponds to a part of the original image. This is equivalent to splitting the original image into several small pieces, and the prediction of coordinate offset for each piece is the same as the two-dimensional offset prediction of the deformable convolution. Thirdly, the offset is then superimposed on the normalized coordinates of the original image. Finally, the Resize module uses a bilinear interpolation method to sample the text feature map to the original size as the revised text.
The input of the whole text recognition algorithm is the text bounding box detected by the text detection algorithm. Due to the irregular shape of the text, the size of the detected text bounding box is different. If the text is directly input into the text recognition algorithm, the training speed of the text recognition algorithm will be reduced. Therefore, after the preprocessing module, the text bounding box is fixed to a uniform size, namely 64 pixels in height and 200 pixels in width, and then the feature map is obtained by extracting features continuously through CNN, and the coordinate offset is returned. The detailed structure and parameter configuration of TPC are shown in
Table 1.
In
Table 1, k3 means the size of convolution kernel is 3 × 3, num64 means the number of convolution kernel is 64, s1 means the stride is 1, p1 means the padding is 1, Conv means convolution, and AVGPool means average pooling. The number of convolution kernels gradually increases from the first layer, and then decreases. Finally, the number is set as 2 in order to generate a two-dimensional offset feature map, whose size is 2 × 11. This is equivalent to dividing the entire input image into 22 blocks, each corresponding to the corresponding coordinate offset value. The activation function Tanh is used to adjust the predicted value of the migration to between [−1, 1], and return the offset of the
X-axis and the offset of the
Y-axis through two channels respectively. Then, the Resize module is used to sample the offset feature map of the two channels to the size of the original figure 2 × 64 × 200. Sample is a bilinear interpolation up-sampling module to obtain the revised text.
Each value in the offset feature map represents its corresponding coordinate offset of the point in the original image. In order to correspond to the dimension of the feature map, the coordinates of each pixel in the original image need to be normalized. The normalized coordinate interval is between [−1, 1], and it also contains two channels, namely the
X-axis channel and
Y-axis channel.
Figure 3 is the comparison of the original image before and after the normalization of coordinates.
The image is stored in the form of matrix in the computer, so the upper left corner of the image in
Figure 3 is the origin of the coordinate axis (0,0), the horizontal axis represents the width of the image, and the vertical axis represents the height of the image. After normalization, the center of the image is the origin of the coordinate, the upper left corner in
Figure 3 is the coordinate (−1,−1), and the lower right corner is the coordinate (1,1). The generated normalized image is double-channel, and the coordinates of pixels in the same position on different channels are the same. After that, the offset feature image is superimposed with the corresponding area of the normalized image to complete the correction of the corresponding position of each pixel. The formula is expressed as:
where,
channel refers to the number of channels,
T represents the feature map after position correction,
offset represents the offset feature map,
G represents the normalized image, (
i,
j) represents the coordinates of the normalized image, (
i’,
j’) represents the coordinates of the original image, (
ii’,
jj’) represents the revised offset coordinates,
F’ represents the corrected image,
F represents the original image.
Adding the corresponding offset to the normalized image, the offset of each point coordinate on the normalized image occurs in both horizontal and vertical directions. The offset is (∆x, ∆y), the revised offset coordinate is (ii, jj), and then the size is up-sampled to the original size by bilinear interpolation method. The revised image F’ is obtained, whose corresponding coordinate is (ii’, jj’), The relation between the original image and the normalized image is shown in Formula (2). The pixels of the two points remain the same size, just the position coordinates are changed.
2.2. Encoder Network
The EN module encodes the spatial and sequential features of extracted text images into fixed feature vectors [
24]. Feature extraction network [
25] plays a key role in the EN module. A good feature extraction network can determine the quality of encoding and has a great impact on the recognition effect of the whole text recognition algorithm. In this paper, the EN module adopts the methods of dense connection network and BLSTM to extract text features, in which dense connection network can extract rich spatial features of text images. Considering the context sequence feature of text, the feature relation between different characters can be learned by BLSTM. The EN module designed in this paper is easy to train and has a good effect. A brief introduction about it is as follows:
(1) Dense connection network is stacked by several dense blocks. Taking the advantages of DenseNet [
26] in feature extraction, dense connection network is used to improve the direction of information flow during feature extraction, and all layers in a dense block can be connected by jumping. Each convolution layer can obtain feature information from all previous layers, enhance the reuse of multi-layer features, and transmit feature information to all subsequent layers. At the same time, the method of jumping direct connection makes it easier to obtain the gradient in the process of back propagation, simplifies the feature learning process, and alleviates the gradient dispersion problem.
(2) The detailed structure of the two BLSTMs is shown in
Figure 4. Each BLSTM has two hidden layers, recording two states of the current time t: one is the forward state from front to back, the other is the reverse state from back to front. The input of the first layer is the sequence of feature vectors extracted by CNN {x
0, x
1, …, x
i}, the output after a layer of BLSTM is {
,
, …,
}. And then it is taken as the input of the second layer, finally the output sequence {
,
, …,
} can be got. As can be seen from
Figure 4, the output of each time t is determined by the hidden layer state in both directions. In this paper, two BLSTMs are stacked to learn the feature states of the four hidden layers, which can not only store more memory information, but also better learn the relationship between feature vectors.
The dense block generates a two-dimensional feature map, while the input of BLSTM is in the serialized form. Therefore, it is necessary to convert the feature map into the sequence form of feature vectors, and then learn the context feature relationship between sequences through BLSTM.
Figure 5 shows the process of transforming the feature map into the feature vector sequence. The feature map is evaluated according to the column of a certain width, the vertical direction is taken as a feature vector.
As can be seen from
Figure 5, the character “O” requires multiple feature vectors to determine the output value, and it is impossible to accurately predict the character by relying on only one feature vector. Therefore, learning the correlation between feature vectors through BLSTM plays an important role in character recognition.
The EN module adopts four dense blocks, followed by two convolution layers, between which there is one Max Pooling and activation function layer, and then two BLSTM layers. The detailed parameters of the EN module are shown in
Table 2.
As can be seen from
Table 2, EN module adopts several convolution layers, pooling layers and activation function layers. The detailed parameters of the convolution layer include the size of convolution kernel, the number of convolution nuclei, stride and padding, which are respectively represented by k, num, s and p. The Max Pooling method is adopted in the all pooling layers, and the parameters are convolution kernel size k and stride s. The activation function takes the Swish function. The number of convolution nuclei in the four dense blocks gradually increases. In each Dense Block, “×4” represents four consecutive convolution layers, followed by two convolution operations. Finally, two BLSTMs are adopted, in which the number of hidden layer units of BLSTM in each layer is 256.
2.3. Decoder Network
The DN module is the reverse process of the EN module, which decodes the encoded feature vectors into output sequences and makes the decoding state as close as possible to the original input state. The text area of a text image usually exists in the form of a sequence, with variable length, and its feature vector is serialized. Therefore, this paper adopts soft attention mechanism [
27] to focus the serialized feature vectors according to the weight distribution, which can effectively use the character features at different moments to predict the output value, and finally connects a layer of LSTM, which can store the past state and determine the output of the current moment through the output of the previous moment. The detail structure on DN is shown in
Figure 6.
Figure 6 shows that the feature vector sequence generated by the EN is directly used as the input of the DN, the hidden layer of BLSTM in the process of the EN contains context feature of text feature vector sequence, the feature vector set can be set as [h
1, h
2, …, h
i, …, h
T], in which the feature Hi generated at each moment i consists of two directions of feature combination, h
i = [h
i,
].
Ct is the semantic encoding vector of the attention model, represents the weighted value of hidden layer feature hi at time t in LSTM network, is expressed as Equation (3).
In
Figure 6, T represents the attention range of the attention mechanism, and its length is 30. If T is too large, the hidden layer needs to remember too much information, the calculation of the model increases rapidly, and the general text statement rarely exceeds 30 words. And too large T value will also make the model’s attention be distracted, so that the DN module cannot focus on the key feature vectors, and the decoding effect is not good. In this paper, the designed DN module takes the predicted output of the previous moment as the input of the current moment through LSTM, which can serve as a reference for the prediction of the current moment. In
Figure 6, the output at the current moment can be accurately determined to be “P” based on the past output state. The detailed Formula (3)–(7) of the whole decoding process is as follows:
In the above Equations (3)–(7), At,i represents the attention weight after normalization, et,i represents the weight of attention, st−1 represents the hidden layer state of the DN module at time t − 1, st represents the hidden layer state of the DN module at time t, f and g represent the nonlinear activation function, and yt represents the predicted output of the DN module at time t. yt is determined by the predicted output yt−1 of the previous moment, the hidden layer state st of the DN module and the attention semantic coding Ct.