Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN

Liu, Xian; Yang, Fang; Guo, Wei

doi:10.3390/app14114420

Open AccessArticle

Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN

by

Xian Liu

^1,2,†,

Fang Yang

^1,2,*,†

and

Wei Guo

^1,2

¹

School of Cyberspace Security and Computer, Hebei University, Baoding 071000, China

²

Institute of Intelligence Image and Document Information Processing, Hebei University, Baoding 071000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(11), 4420; https://doi.org/10.3390/app14114420

Submission received: 28 March 2024 / Revised: 16 May 2024 / Accepted: 20 May 2024 / Published: 23 May 2024

(This article belongs to the Special Issue Advances in Neural Networks and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Scene image text segmentation is an important task in computer vision, but the complexity and diversity of backgrounds make it challenging. All supervised image segmentation tasks require paired semantic label data to ensure the accuracy of segmentation, but semantic labels are often difficult to obtain. To solve this problem, we propose an unsupervised scene image text segmentation model based on the image style transfer model cyclic uniform Generation Adversarial network (CycleGAN), which is trained by partial unpaired label data. Text segmentation is achieved by converting a complex background to a simple background. Since the images generated by CycleGAN cannot retain the details of the text content, we also introduced the Atrous spatial Pyramid pool module (ASPP) to obtain the features of the text from multiple scales. The resulting image quality is improved. The proposed method is verified by experiments on a synthetic data set, the IIIT 5k word data set and the MACT data set, which effectively segments the text and preserves the details of the text content.

Keywords:

scene image text segmentation; image style transfer; unsupervised; CycleGAN; ASPP

1. Introduction

In the last decade, scene text has been extensively studied. Since the development of convolutional neural networks, many solutions have been proposed. Scene text segmentation focuses on precisely separating the text from the background and outlining the character blocks. Therefore, text segmentation is an effective means to obtain text information in images, and it is a research hotspot in the computer field.

The difficulty of this work is that the font, color, and shape of the scene text are very different. The early methods based on image processing are mainly threshold algorithms [1,2,3,4,5,6], but they are very sensitive to the selection of threshold values in complex backgrounds, and the adaptability of different scenes is limited. Later, the proposed method combining text segmentation with probabilistic models [7,8,9,10] faced challenges such as data dependence and parameter estimation. Clustering methods [11,12,13] are also used to deal with text segmentation. With the development of neural networks, convolutional network-based methods can effectively handle various cases. However, the bottleneck of this approach is the high dependence on large-scale data, especially manually labeled data. For this reason, unsupervised segmentation networks have been proposed, such as fully convolutional adaptive networks (FCANs) [14], CyCADA [15] and CrDoCo [16]. However, these methods deal with the segmentation of street scenes, which is not helpful for the segmentation of text.

To this end, we propose using an unsupervised method to accomplish text segmentation with a complex background. We propose treating the segmentation task of scene text as a kind of style transfer of image, which is a novel perspective. We choose a CycleGAN network [17] that can carry out unsupervised style transfer. The two GAN networks are constrained by each other to form a cyclic adversarial network. A scene text image with a complex background is transformed into a text image with a simple background, and the model does not need to consider the interference of the complex background in the image. CycleGAN, on the other hand, learns network mapping without supervision. Scene image text segmentation is relatively simple in terms of semantic content, but the details of text strokes are more important. Therefore, in order to improve the edge detail of the generated image, we introduce the ASPP module [18]. When acquiring image features, ASPP can expand the receptive field, extract more abundant global information, improve the edge information of CycleGAN when generating segmentation images, and reduce the edge noise.

This paper makes the following contributions:

This paper regards scene image text segmentation as a style transfer problem between images. From a scene text image with a complex background, after processing, it is transformed into a text image with a concise background, which provides convenience for subsequent processing. This transformation process actually reveals a novel perspective.
This paper uses CycleGAN for the text segmentation of scene images. The detail processing of the edge of the text segmentation image generated by CycleGAN is not enough and not clear enough. The ASPP module is introduced into CycleGAN’s generator to obtain multi-scale text features and retain the details generated by image text edges.
This paper conducts experiments on synthetic data sets, IIIT 5k word data sets, MACT data sets and MTWI data sets. Experimental results show that our method can effectively segment text while retaining clear edge information.

2. Related Work

2.1. Scene Text Segmentation

Scene text with complex background is the most challenging problem in scene text segmentation. Traditional text segmentation methods mainly include the threshold algorithm, probabilistic model algorithm and clustering algorithm. Each of these methods has its own characteristics and plays an important role in different application scenarios. As a key technology in text segmentation, the threshold algorithm is mainly used to distinguish the foreground (such as text) and background in an image. According to different threshold processing methods, it can be further divided into the global threshold method [1,2,3] and local threshold method [4,5,6]. The probabilistic model approach [7,8,9,10] has significant advantages in dealing with uncertainty, prediction and interpretation. The Gauss hybrid model and Markov random field model are representative of them. The clustering method [11,12,13] performs well in processing degraded text extraction in videos, mainly by mining the similarity between pixels to achieve the aggregation of text regions. Convolutional neural networks with learning ability can process all kinds of scene images, but their bottleneck is the high dependence on large-scale data, especially on manually labeled data. Bonechi et al. [19] generated COCO-TS and MLT-S. The COCO-Text [20] data set and MLT data set [21] were used for the weakly supervised processing of these two new data sets to obtain pixel-level annotations. However, the annotation quality of the new machine-generated data set COCO-TS is still far from that of manual annotation. Wang [22] proposed a network in which two branches guide each other. One branch generates the mask of the polygon, and the other branch generates the pixel-level text mask. The whole network training adopts a semi-supervised mode, but the mask is generated in the process of data processing. The accuracy of supervised segmentation methods depends on paired data, but the amount of paired data is limited, and manual labeling is time consuming. Therefore, in recent years, segmentation algorithms with unsupervised methods as the core have attracted much attention, such as fully convolutional adaptive networks (FCANs) [14], CyCADA [15], CrDoCo [16], etc. The experiment of transferring FCAN [14] from game video (GTA5) to an urban street scene has achieved good results, but the experimental effect of expanding to specific segmentation scenes needs to be verified. The CyCADA network [15] combines a cyclically consistent adversarial model of image translation with adversarial adaptive methods, primarily for urban streetscapes and digital. The unsupervised domain adaptive adversarial learning method of the CrDoCo network [16] also aims at the semantic segmentation of street scenes.

2.2. Image-to-Image Style Transfer

Image-to-image translation is used to transform the source domain into the target domain, which includes an image style transfer. The origin of this idea is the image analogy proposed by Jacobs et al. [23]. Recently, one-to-one image domain transformations have been scaled up. Isola [24] proposed the Pix2Pix architecture, which uses an end-to-end generative adversarial network as a framework. To preserve the underlying structure of the images, jump connections are used in the generators. However, the network is built with a one-to-one mapping of X-Y domains, which can lead to poor model diversity. Zhu et al. [25] designed BicycleGAN to address this problem by generating two different encodings with bidirectional affine to achieve diverse outputs. The pix2pixHD model [26] was proposed by Wang et al. It builds on Pix2Pix, which produces higher-resolution images. The multi-scale generator can also generate features from a fine-grained level. At present, the most advanced monitoring method is pix2pixHD. However, the supervised approach is complicated by the data it requires to pair, and semi-supervised approaches have received more attention. Zhu et al. proposed CycleGAN [17], which cleverly combines GAN networks with cycle-consistent loss. It uses unpaired data to accomplish the task of image style transfer. In this paper, we utilize CycleGAN to treat the problem of natural image text segmentation as an image-to-image style transfer.

3. Scene Text Segmentation Image Generation

3.1. Overview of the Architecture

CycleGAN [17] is a ring-shaped structure, as shown in Figure 1a, mainly consisting of two generators

G_{A}

,

G_{B}

and two discriminators

D_{A}

,

D_{B}

. a denotes images in domain A, and b denotes images in domain B. There are two direction processes in CycleGAN. As shown in Figure 1b, the original image of domain A is transferred to domain B by the generator

G_{A}

, and then it is re-transferred to domain A by generator

G_{B}

. Figure 1c shows the other direction’s procession. In order to ensure the style transfer of the image, the discriminators

D_{A}

and

D_{B}

play a discriminating role.

We use CycleGAN to complete text segmentation of scene images. Images in domain A are scene text images with a complex background; images in domain B are text images without background and can be regarded as images with semantic labels. However, images in domains A and B are not paired. In this way, we convert the task about scene image text segmentation into a supervised image style transfer task.

In this paper, the generation process of the scene text segmentation network is shown in Figure 2. The scene text image a with a complex background is converted by the generator

G_{A}

to

\hat{b}

with the background image removed, and then it is reconstructed back to a complex background image

\hat{a}

by the generator

G_{B}

. The discriminator

D_{B}

is used to determine whether the generated image is a text image with the background removed. The image b without background generates image

\tilde{a}

with complex background through the generator

G_{B}

, and then it restores it to the image

\tilde{b}

without background through the generator

G_{A}

. The discriminator

D_{A}

is used to determine whether generated images are complex background images. The text segmentation images generated using CycleGAN are less effective in terms of details, so we change the structure of the network to enhance the detail of the segmented images by introducing the ASPP module in the generator.

After model training is completed, we need to use the generator

G_{A}

for text segmentation of the scene image. Inputting the scene text image in domain A into the generator

G_{A}

will generate text image without background and complete scene text segmentation.

3.2. Network Architecture

3.2.1. ASPP

In the segmentation task, it is very important to keep the resolution of the image and the context information of different scales for the accurate segmentation of the task target. The Atrous convolution in the ASPP module [18] not only does not sacrifice the feature map resolution, it also expands the receptive field and captures multi-scale context information by using the Atrous convolution with different expansion rates in parallel.

Figure 3 shows the concrete structure. The components of ASPP include the following: a

1 \times 1

convolution, Pooling pyramid, and ASPP Pooling. ASPP can be extracted from multi-scale features, which is because we can customize the pooling expansion factor of each layer of the pyramid.

Atrous convolution has an expansion rate, which is different from the ordinary convolution. The expansion rate controls the padding and dilation during convolution. Through different padding and expansion, different scales of receptive field can be obtained, and multi-scale information can be extracted. It is noteworthy that the convolution kernel size is constant at

3 \times 3

.

ASPP Pooling starts with an AdaptiveAvgPool2d layer. The so-called AdaptiveAvgPooling is adaptive in that it is unnecessary to specify the Kernel size and stride; only the final output size (

1 \times 1

in this case) needs to be specified. By compressing the feature maps of each channel separately to

1 \times 1

, the features of each channel are thus extracted, and the global features are obtained. Then comes the convolution layer of

1 \times 1

, which further extracts the features obtained in the previous step and reduces the dimensionality. It is worth noting that ASPP Pooling only extracts features. However, in addition to the sequential execution of the layers of the network, the feature map is finally sampled back to the original size from

1 \times 1

.

3.2.2. Generator

CycleGAN includes two generators, both of which have the same structure and are composed of an encoder, a transformer, an ASPP module and a decoder. However, the generator

G_{A}

realizes the conversion from a complex background image to no background image,

G_{A} : A \to B

; the complex background is removed. The generator

G_{A}

, on the other hand, restores the non-background image to a complex background image,

G_{B} : B \to A

. Figure 4 shows the internal structure of the generator in detail.

Both the input and output of the generator are RGB images. In order to reduce the computational effort, the input image is mapped from higher dimensions to lower dimensions. In other words, the input image is downsampled. Downsampling uses one convolutional layer of stride 1 and two convolutional layers of stride 2. In this way, the computational cost is the same, but you obtain a larger and more detailed network and receptive field. It can make the style transfer complete better. The converter consists of six residual blocks, which are responsible for extracting features in the image and converting the extracted image feature vector from domain A to domain B. After the feature vector conversion of the image is completed, it will enter the newly introduced ASPP module. The ASPP module continues to process these features and obtain global features so as to fuse features from multiple scales and improve segmentation details. The decoder reconstructs the output of the converter to the size of the original image. The process of image processing by the decoder is the inverse process of the encoder. It uses deconvolution layers to map images from low to high dimensions until the resulting image is equal in size to the original image.

When CycleGAN is used to complete the task of scene text segmentation, the text edge is not clear enough, and there is a lot of noise. We consider it because in the convolution process, the global information is ignored and the edge contour is not obvious. The ability of ASPP to capture global information is well known, so we chose to introduce ASPP into the generator.

3.2.3. Discriminator

The discriminator has a simple construction. Its role is to determine whether the input image is an image of a particular domain. As shown in Figure 5, the discriminator firstly uses four-layer convolution and an activation function to extract and process image features, and finally, it obtains a Patch output through convolution, taking the number of output channels as 1 to realize the purpose of “discrimination”.

In order to change the output into a single value of 0 or 1, the traditional GAN network discriminator usually uses the sigmoid function at the end, which is a judgment and evaluation of the discriminator on the input image, with 1 representing true and 0 representing false. However, the discriminator in CycleGAN adopts the idea of PatchGAN [17]. The so-called Patch means that after a series of convolution layers, the image will not be directly input into the full connection layer or a simple activation function. Instead, the use of a channel number 1 convolution will feature mapping for the

N \times N

matrix. The function of this matrix is equivalent to the traditional single evaluation value. Compared with the discrimination mode of output only one value, the output

N \times N

matrix has the advantage of the matrix of every point on behalf of the original image in the value of a small area, which is the meaning of Patch. Instead of using a single value to measure the whole picture, the whole picture is now evaluated using an

N \times N

matrix. In this case, the label is also set to an

N \times N

format, so that the loss calculation can be performed. The advantage of PatchGAN is that it uses more “receptive fields” and can combine more areas.

3.3. Loss Functions

3.3.1. Adversarial Loss

The image to image translation is actually the translation of domain A and domain B. For the CycleGAN model, the translation from the image in domain A to the image in domain B is achieved in the generator

G_{A}

and then through the discriminator

D_{B}

to determine whether the data are real. This is the process of generating antagonism in the GAN network, and the loss generated in the process is defined by the following formula:

L_{G A N} (G_{A}, D_{B}, A, B) = E_{b \sim P_{data (b)}} [log D_{B} (b)] + E_{a \sim P_{data (a)}} [log (1 - D_{B} (G_{A} (a)))]

(1)

The ultimate goal is shown below:

G_{A}^{*} = a r g min_{G_{A}} max_{D_{B}} L_{G A N} (G_{A}, D_{B}, A, B)

(2)

Instead, another generator,

G_{B}

, will transform the image in domain B to the image in domain A. Judging by another discriminator

D_{A}

, the generated data are real. The losses resulting from this generative antagonism process are as follows:

L_{G A N} (G_{B}, D_{A}, B, A) = E_{a \sim P_{data (a)}} [log D_{A} (a)] + E_{b \sim P_{data (b)}} [log (1 - D_{A} (G_{B} (b)))]

(3)

The ultimate goal is shown below:

G_{B}^{*} = a r g min_{G_{B}} max_{D_{A}} L_{G A N} (G_{B}, D_{A}, B, A)

(4)

3.3.2. Cycle-Consistent Loss

CycleGAN adds cycle-consistent loss to learn the mapping of

G_{B}

and

G_{A}

.

G_{A}

is used to transform the image a in domain A into domain B space and then restore it to a. Images in domain A may all be converted to the same image in domain B. To avoid this situation, calculate the loss between a and

G_{B} (G_{A} (a))

. Similarly, all images in the B domain may be converted to the same image in the A domain. Again, the loss between b and

G_{A} (G_{B} (b))

needs to be calculated to avoid this. The cycle-consistent loss function was added to make the training process of CycleGAN more stable. Add restrictions to preserve the characteristics of the input image and reduce the mapping space. The loss function for cycle-consistent is defined as follows:

L_{c y c} (G_{A}, G_{B}) = E_{a \sim P_{data (a)}} [{∥G_{B} (G_{A} (a)) - a∥}_{1}] + E_{b \sim P_{data (b)}} [{∥G_{A} (G_{B} (b)) - b∥}_{1}]

(5)

3.3.3. Identity Loss

We use the generator

G_{A}

to generate B-style images, so entering b into

G_{A}

should also generate b. Therefore,

G_{A} (b)

and b should be as similar as possible. Without this loss, the generator may automatically modify the tone of the image, resulting in a change in the overall color.

L_{I d e n t i t y} (G_{A}, G_{B}) = E_{b \sim P_{data (b)}} [{∥G_{A} (b) - b∥}_{1}] + E_{a \sim P_{data (a)}} [{∥G_{B} (a) - a∥}_{1}]

(6)

3.3.4. Overall Loss

So, the total loss generated by the whole network training is shown below:

L_{t o t a l} = L_{G A N} (G_{A}, D_{B}, A, B) + L_{G A N} (G_{B}, D_{A}, B, A) + α L_{c y c} (G_{A}, G_{B}) + β L_{I d e n t i t y} (G_{A}, G_{B})

(7)

α

is the weight of the cycle-consistent loss and

β

is the weight of the identity loss.

4. Experiments

4.1. Experimental Setting

4.1.1. Data Preparation

The training set and testing set used in the experiment are generated by Text-Recognit ion-Data-Generator [27]. Text-Recognition-Data-Generator is an open-source OCR data set generation project, which can customize scene text images according to the specific scene data set. We synthesized 26,352 images using 10,000 images as backgrounds and the text set in the project as foregrounds, including 13,143 images in the domain A (complex background text images) and 13,209 images in the domain B (no background text images). The training set consists of 10,143 A-domain images and 10,009 B-domain images. The test set consists of 3000 A-domain images and 3200 B-domain images.

4.1.2. Experimental Parameter Settings

The experiments were performed on an Ascend910 with the MindSpore framework. The input and output of generators

G_{A}

and

G_{B}

are both

256 \times 256

RGB images; the encoder in the generator includes a

7 \times 7

convolution with a step size of 1 and two

3 \times 3

convolutions with a step size of 2, and the figure of channels in the output is 64, 128, and 256, respectively; the residual block is designed as two

3 \times 3

convolution layers, and the structure is constructed based on the jump structure. The ASPP inflation factors are set to 6, 12, and 18; the decoder is set symmetrically with the encoder. The discriminators

D_{A}

and

D_{B}

adopt the design of patchGAN. We set the batch size to 1 and train the network for 100 rounds. The learning rate of the first 50 rounds is set to 0.0002, and the learning rate of the last 50 rounds decreases linearly until the attenuation is 0. The Adam optimizer is used for network optimization throughout the process.

4.1.3. Evaluation Criteria

In the evaluation of experimental results, this paper adopts two evaluation methods: qualitative and quantitative. The qualitative evaluation mainly focuses on the visual performance of the images generated by the model, such as the thoroughness of background removal, the integrity of text retention, and the overall natural degree of the image. Through visual observation of the generated images, we can preliminarily judge whether the model can effectively extract text content from the complex background. Quantitative evaluation is more objective and quantifies the performance of the model by calculating a series of evaluation indicators, which can more accurately reflect the ability of the model to remove background and retain text. By combining the qualitative and quantitative evaluation results, we can comprehensively evaluate the performance of the improved CycleGAN network model in processing text images with complex backgrounds.

There are two criteria for quantitative evaluation, which are described in detail below.

IS, inception score, is a common evaluation index in GAN. Its design is mainly based on two evaluation indicators of GAN: one is the quality of the result, that is, the clarity of the generated image; the other is the diversity of the generated image.

FID, Fréchet Inception Distance (FID), is also used to evaluate GANs, which was proposed by Heusel et al. It is used to evaluate the quality of images generated by GAN networks because it correlates with how well humans judge visual quality. The closer the distance between the data distribution of the generated image and the real image, the lower the value of the FID will be. We compute the FID between the test set of the data set and the relevant target domain image using the following formula:

F I D (P_{r}, P_{g}) = ∥μ_{r} - μ_{g}∥ + T_{r} (C_{r} + C_{g} - 2 {(C_{r} C_{g})}^{\frac{1}{2}})

(8)

The sample mean of the two data sets for FID calculation is expressed by

μ_{r}

and

μ_{g}

; the sample covariance is expressed by

C_{r}

and

C_{g}

.

IS is quantitatively assessed, and the calculated results are positively correlated with image quality. The better the quality of the generated image, the higher the calculated IS. The FID is unconstrained, and its calculation results are inversely proportional to the image quality. The lower the value of the calculated result, i.e., the closer the distribution of the generated image and the original image, the more visual properties are shared.

4.2. Experimental Results

A CycleGAN generator based on the codec structure is proposed, including an encoder, style converter and decoder. UNet is a typical codec structure with good segmentation performance. The difference of UNet is that UNet adds a skip layer structure SKIP-CONNECTION, which stitches together the feature maps of the same size before and after the codec, while the codec structure of ordinary networks is sampled to low latitude and then sampled to the original resolution. This method can preserve pixel-level details at different resolutions to a certain extent. Therefore, we changed the generator to UNet and tested it against CycleGAN’s original generator. At the same time, the ASPP module is referenced in the two generators respectively for ablation experiments. Experimental data and detailed experimental Settings are described above.

4.2.1. Qualitative Evaluation

The two generators are applied to the text segmentation of scene images with complex backgrounds, and the results of the experiment are compared. The qualitative results are shown in Figure 6, where we can visually compare the generated images of different networks. The first line is the input picture, and the second line is the segmentation results of the UNet as a generator. By observation, it is found that the segmentation effect is not good, and the network cannot distinguish between foreground and background. The third line shows the generation results after the introduction of the ASPP module for UNet, and the generation effect is not greatly improved. We believe that our task is to deal with text segmentation in complex backgrounds, and the UNet network cannot distinguish the background and text well when dealing with overly complex backgrounds. The fourth line is the original generator of CycleGAN, which can segment the complex background and foreground text well, but a lot of noise shadows appear around the text. The fifth line is the original generator with the introduction of the ASPP module, which is our proposed approach, and the noise shadows around the text are observed to be significantly improved.

In terms of qualitative evaluation, we believe that the original generator of CycleGAN is able to handle the segmentation of text in complex backgrounds well after the introduction of the ASPP module. The introduction of ASPP improves the overall effect of image generation and improves the details of image generation.

4.2.2. Quantitative Evaluation

We quantitatively evaluate the generated results through the evaluation indicators IS and FID. The value of IS is positively correlated with the quality of the image. As shown in Table 1, the generation results of the two generators are compared. In our task, the original generator of CycleGAN has the highest IS value after the introduction of the ASPP module. The value of FID is negatively correlated with image quality. The results in the table show that the original generator of CycleGAN has the lowest FID value after the introduction of the ASPP module. It is noteworthy that with the introduction of the ASPP module, the value of IS increases by 45.6% and the value of FID decreases by 11.6% compared to the original generator. In other words, it improves the quality of the generated images to a great extent.

Combining the qualitative and quantitative results, the original generator with the ASPP module performs better in the task of text segmentation under complex background. At the same time, more text features and better visual effects are retained, which proves the validity of our experiment.

4.3. Experimental Effects on Other Data Sets

4.3.1. Data Preparation and Training Details

After determining the specific modules and architecture, we conduct identical experiments on The IIIT 5K-word data set [28] and MACT dataset [29].

The IIIT 5K-word data set [28] is obtained from Google Maps. The data set consists of 5000 word images, which are all cropped from Scene Texts and born-digital images. Of these 5000 images, 4000 are for the A-domain of the training set, and the remaining 1000 are for the A-domain of the test set. Overall, 5100 composite images are used in domain B of the training set, and 1140 are used in domain B of the test set.

MACT [29] is a multi-scene ancient Chinese text data set consisting of natural scene text in handwriting, calligraphy, and ancient fonts established by Kaili Wang et al. It contains 138,935 synthetic images for training and 14,318 real scene images for testing. We screen 13,000 pieces from the synthetic sample as the domain A of the training set and 1400 pieces from the real sample as the domain A of the test set. For the background-free text images in domain B, we collected a data set of background-free Chinese text images, including 5393 images of Wang Xizhi font, 5405 images of Yan Zhenqing font, and 2665 images of Li Shutong (Hongyi) font, with a total of 13,463 images for the training set domain B, and 1588 images of Zhu Da (Badashanren font) for the test set domain B.

The detailed numbers of each data set are shown in Table 2.

The experiments are conducted on Ascend910 with the MindSpore framework. In order to facilitate training and evaluation, all data are unified in the

256 \times 256

pixel size. Set the batch size to 1 and train the network for 100 rounds. The learning rate of the first 50 rounds is set to 0.0002; the learning rate of the last 50 rounds decreases linearly until the attenuation is 0. The Adam optimizer is used for network optimization throughout the process.

4.3.2. Qualitative Evaluation

(1) The IIIT 5K-word data set

The experimental results of the IIIT 5K-word data set are shown in Figure 7. Through observation, we find that the U-Net network as a generator is not good for segmenting the background and text in the scene image, and there is no significant improvement after the introduction of the ASPP module. The original generator of CycleGAN can segment the background and text. However, the segmentation effect is poor, and many noise points appear around the font, which affects the segmentation effect. After the introduction of the ASPP module into the original generator, the noise around the font is effectively handled, and the segmentation of background and text meets our expectation.

(2) MACT data set

The experimental results of the MACT data set are shown in Figure 8. By observation, we find that as a generator, the U-Net network does not achieve our expected results, and the segmentation effect is poor even if the ASPP module is added. The original generator of CycleGAN can segment the background and text better. However, blurred or redundant strokes will appear in the detail part, which is not allowed for Chinese recognition. After the introduction of the ASPP module, the blurred parts turn clear, and the redundant strokes are also removed.

In terms of qualitative evaluation, we believe that the original generator of CycleGAN has good segmentation effects on both the IIIT 5K-word data set and the MACT data set with the addition of the ASPP module. Moreover, the generated images perform well in terms of details and overall effects.

4.3.3. Quantitative Evaluation

The quantitative evaluations of the two data sets are shown in Table 3. The information in the table shows that both U-Net as a generator and the original generator have improved the value of IS after the introduction of ASPP module. However, it is the original generator that has the highest IS value after the introduction of the ASPP module. For the FID value, both data sets have the smallest value for the original generator after the introduction of the ASPP module.

Combining the qualitative and quantitative results, the original generator with the introduction of the ASPP module will have good results for text segmentation in complex backgrounds with clearer segmentation details. This indicates that our experiment has universal applicability for scene image text segmentation tasks.

4.3.4. Synthetic to Real Adaptation

We visualized the image between a text image with a complex background and a text image without a background in Figure 9. Figure 9a is from the MTWI data set, which covers dozens of different font styles, font sizes ranging from a few pixels to a few hundred pixels, multiple layout layouts, and a variety of challenging interference backgrounds. These images are all from the real Internet environment. Figure 9c Text images without background from the composite data set, and their image spaces are transformed into opposite domains Figure 9b,d. Our model achieves a high practical domain transformation. This can explain the adaptability between image spaces. The most obvious difference between the original image and the generated image is whether there is a complex background. A text image without a background highlights the text information in the graphics, which is very important to us and is our goal.

The results in columns Figure 9a,b also show our adaptation from the composite image to the real image. Our method can complete the conversion from a text image with a complex background to a text image without background.

5. Conclusions

In this paper, the text segmentation of scene images is transformed into an unsupervised image-to-image style conversion task. We use the style transfer model CycleGAN to segment the text in the image. This eliminates the need to manually label data and pair data sets. We introduce the ASPP module into the generator to obtain multi-scale information features through convolution so as to refine the details of the generated segmented images. Our experiments were carried out on synthetic data sets, IIIT 5K-word data sets and MACT data sets. The improved CycleGAN network can significantly improve the processing of text edge information with the IS value increased by 45.6%, 47.5% and 3.57% and the FID value decreased by 11.6%, 47.7% and 10.7%, respectively. This validates not only the generality of our experiment but also that the introduction of ASPP can indeed improve the details of scene text segmentation However, the scalability and real-time performance of our approach is not outstanding, and we will continue to study it in the future.

Author Contributions

Software, X.L. and W.G.; Investigation, X.L.; Writing—original draft, X.L. and F.Y.; Writing—review & editing, X.L. and F.Y.; Visualization, X.L. and W.G.; Supervision, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the data set was produced by our whole research group.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jiao, S.; Li, X.; Lu, X. An improved Ostu method for image segmentation. In Proceedings of the 2006 8th international Conference on Signal Processing, Guilin, China, 16–20 November 2006; Volume 2. [Google Scholar]
Jaynes, E.T. On the rationale of maximum-entropy methods. Proc. IEEE 1982, 70, 939–952. [Google Scholar] [CrossRef]
Hageman, L.A.; Young, D.M. Applied Iterative Methods; Courier Corporation: North Chelmsford, MA, USA, 2012. [Google Scholar]
Plenge, E.; Poot, D.H.; Bernsen, M.; Kotek, G.; Houston, G.; Wielopolski, P.; van der Weerd, L.; Niessen, W.J.; Meijering, E. Super-resolution methods in MRI: Can they improve the trade-off between resolution, signal-to-noise ratio, and acquisition time? Magn. Reson. Med. 2012, 68, 1983–1993. [Google Scholar] [CrossRef] [PubMed]
Khurshid, K.; Siddiqi, I.; Faure, C.; Vincent, N. Comparison of Niblack inspired binarization methods for ancient documents. In Proceedings of the Document Recognition and Retrieval XVI; SPIE: Bellingham, WA, USA, 2009; Volume 7247, pp. 267–275. [Google Scholar]
Kia, O.E.; Sanvola, J. Active multimedia documents for mobile services. In Proceedings of the 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No. 98EX175), Redondo Beach, CA, USA, 7–9 December 1998; pp. 227–232. [Google Scholar]
Ye, Q.; Gao, W.; Huang, Q. Automatic text segmentation from complex background. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; Volume 5, pp. 2905–2908. [Google Scholar]
Mishra, A.; Alahari, K.; Jawahar, C. An MRF model for binarization of natural scene text. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 11–16. [Google Scholar]
Lee, S.; Kim, J.H. Integrating multiple character proposals for robust scene text extraction. Image Vis. Comput. 2013, 31, 823–840. [Google Scholar] [CrossRef]
Mishra, A.; Alahari, K.; Jawahar, C. Top-down and bottom-up cues for scene text recognition. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2687–2694. [Google Scholar]
Mancas-Thillou, C.; Gosselin, B. Color text extraction with selective metric-based clustering. Comput. Vis. Image Underst. 2007, 107, 97–107. [Google Scholar] [CrossRef]
Kita, K.; Wakahara, T. Binarization of color characters in scene images using k-means clustering and support vector machines. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3183–3186. [Google Scholar]
Zhu, Y.; Sun, J.; Naoi, S. Recognizing natural scene characters by convolutional neural network and bimodal image enhancement. In Proceedings of the Camera-Based Document Analysis and Recognition: 4th International Workshop, CBDAR 2011, Beijing, China, 22 September 2011; pp. 69–82. [Google Scholar]
Zhang, Y.; Qiu, Z.; Yao, T.; Liu, D.; Mei, T. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 6810–6818. [Google Scholar]
Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar]
Chen, Y.C.; Lin, Y.Y.; Yang, M.H.; Huang, J.B. Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1791–1800. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Bonechi, S.; Bianchini, M.; Scarselli, F.; Andreini, P. Weak supervision for generating pixel–level annotations in scene text segmentation. Pattern Recognit. Lett. 2020, 138, 1–7. [Google Scholar] [CrossRef]
Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454–1459. [Google Scholar]
Wang, C.; Zhao, S.; Zhu, L.; Luo, K.; Guo, Y.; Wang, J.; Liu, S. Semi-supervised pixel-level scene text segmentation by mutually guided network. IEEE Trans. Image Process. 2021, 30, 8212–8221. [Google Scholar] [CrossRef] [PubMed]
Hertzmann, A.; Jacobs, C.E.; Oliver, N.; Curless, B.; Salesin, D.H. Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 12–17 August 2001; pp. 327–340. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
Shaohui, L. A Synthetic Data Generator for Text Recognition. Available online: https://gitcode.net/mirrors/Belval/TextRecognitionDataGenerator.git (accessed on 19 May 2024).
Mishra, A.; Alahari, K.; Jawahar, C. Scene text recognition using higher order language priors. In Proceedings of the BMVC-British Machine Vision Conference, BMVA, Surrey, UK, 3–7 September 2012. [Google Scholar]
Wang, K.; Yi, Y.; Tang, Z.; Peng, J. Multi-scene ancient Chinese text recognition with deep coupled alignments. Appl. Soft Comput. 2021, 108, 107475. [Google Scholar] [CrossRef]

Figure 1. Bidirectional circulation of CycleGAN.A indicates the image in domain A, and B indicates the image in domain B. (a) shows the cyclic structure of CycleGAN; (b) shows the generation process of transforming an image in domain A into an image in domain B and then restoring it to an image in domain A; (c) shows the generation process of transforming an image in domain B into an image in domain A.

Figure 2. Scene text segmentation network framework.

Figure 3. ASPP module.

Figure 4. The structure of the new generator.

Figure 5. The structure of the discriminator.

Figure 6. Qualitative result presentation. We choose to change the generator into UNet for comparative experimental analysis with the original generator. The ASPP module is introduced in both the UNet and the original generator for ablation.

Figure 7. Qualitative results of the IIIT 5K-word data set.

Figure 8. Qualitative results of MACT data set.

Figure 9. Qualitative results of MTWI data set. (a) is the data from the MTWI dataset. (b) is to transform the data in (a) into the opposite field. (c) is the text image without background, and (d) is the data in (c) transformed into the opposite field.

Table 1. Quantitative results.

	IS	FID
UNet	0.9985	248.1416
UNet+ASPP	2.3317	277.6030
The original generator	1.9735	176.2462
Ours	2.8725	155.7249

Table 2. Numbers of training and test sets of the data set for the experiments.

	Training Set		Test Set
	Domain X	Domain Y	Domain X	Domain Y
The IIIT 5K-word	4000	5100	1000	1140
MACT	13,000	13,463	1400	1588

Table 3. Quantitative results of different data sets.

	The IIIT 5K-Word		MACT
	IS	FID	IS	FID
UNet	3.0557	191.0428	1.7400	208.5827
UNet+ASPP	3.9968	198.7815	1.8932	192.9745
The original generator	2.7973	247.5222	1.9537	99.8899
Ours	4.1251	129.5635	2.0235	89.1945

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Yang, F.; Guo, W. Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN. Appl. Sci. 2024, 14, 4420. https://doi.org/10.3390/app14114420

AMA Style

Liu X, Yang F, Guo W. Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN. Applied Sciences. 2024; 14(11):4420. https://doi.org/10.3390/app14114420

Chicago/Turabian Style

Liu, Xian, Fang Yang, and Wei Guo. 2024. "Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN" Applied Sciences 14, no. 11: 4420. https://doi.org/10.3390/app14114420

APA Style

Liu, X., Yang, F., & Guo, W. (2024). Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN. Applied Sciences, 14(11), 4420. https://doi.org/10.3390/app14114420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN

Abstract

1. Introduction

2. Related Work

2.1. Scene Text Segmentation

2.2. Image-to-Image Style Transfer

3. Scene Text Segmentation Image Generation

3.1. Overview of the Architecture

3.2. Network Architecture

3.2.1. ASPP

3.2.2. Generator

3.2.3. Discriminator

3.3. Loss Functions

3.3.1. Adversarial Loss

3.3.2. Cycle-Consistent Loss

3.3.3. Identity Loss

3.3.4. Overall Loss

4. Experiments

4.1. Experimental Setting

4.1.1. Data Preparation

4.1.2. Experimental Parameter Settings

4.1.3. Evaluation Criteria

4.2. Experimental Results

4.2.1. Qualitative Evaluation

4.2.2. Quantitative Evaluation

4.3. Experimental Effects on Other Data Sets

4.3.1. Data Preparation and Training Details

4.3.2. Qualitative Evaluation

4.3.3. Quantitative Evaluation

4.3.4. Synthetic to Real Adaptation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI