CapGAN: Text-to-Image Synthesis Using Capsule GANs
Abstract
:1. Introduction
- Once the images are synthesized, any modification in a scene or an image can be implemented by means of text as an input instead of using advanced photo editing tools.
- Text-to-image synthesis can improve the predictions of object classification problems, as the synthesis model is generating images from scratch, thus, it has good judgment about object features.
- It will smooth the automatic learning process and art generation of, for example, animated images, clips, movies, etc.
- The images synthesized using text can also be helpful to generate labeled data for further research.
2. Background
3. Methodology
3.1. Input Sentence
3.2. Text Encoding
- Encoder Network: The model takes sentence i and generates a fixed length representation z using a recurrent neural network (RNN).
- Previous Decoder Network: The model takes embedding z and tries to generate sentence using RNN.
- Next Decoder Network: The model takes embedding z and tries to generate sentence using RNN.
3.3. Image Generation
3.3.1. Generative Adversarial Networks
- G receives text as an input and synthesizes an image.
- D accepts a generated image, as well as sample images from the actual dataset, and returns the probability that the image is real, with 1 indicating a real image and 0 indicating a false image.
3.3.2. Generator (G)
3.4. Image Discrimination
3.4.1. Capsule Network
- Probability that the entity exists.
- Instantiation parameters of that entity.
3.4.2. Discriminator (D)
- Real images with real text.
- Synthesized/ fake images with random text.
4. Results
4.1. Experimental Setup
4.2. Evaluation Metric
4.2.1. Inception Score
- Saliency: Saliency indicates that objects in an image should be recognizable. Given x as an input, the predicted output y should have a high probability. In terms of image generation, given an image, an object should be recognized easily. Thus, conditional probability p(y|x) should be high, and as a result, the entropy is low.
- Diversity: Diversity indicates the variety of details in an image. This means, given a predicted output y, the marginal probability p(y), should be high. This implies that for diverse images, the data distribution of y should be uniform, thus, resulting in high entropy.
4.2.2. Fréchet Inception Distance
4.3. Statistical Results
4.4. Visual Results
4.5. Comparative Results
5. Discussion
5.1. Multimodality Preservance
5.2. Synthesis of Global Coherent Structures
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative Adversarial Text-to-Image Synthesis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Dash, A.; Gamboa, J.C.B.; Ahmed, S.; Liwicki, M.; Afzal, M.Z. TAC-GAN-text conditioned auxiliary classifier generative adversarial network. arXiv 2017, arXiv:1703.06412. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Huang, X.; Wang, X.; Metaxas, D. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Dong, H.; Zhang, J.; McIlwraith, D.; Guo, Y. I2T2I: Learning text to image synthesis with textual data augmentation. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
- Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. In Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Dai, A.M.; Le, Q.V. Semi-supervised sequence learning. In Proceedings of the 2015 Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–10 December 2015. [Google Scholar]
- Nilsback, M.E.; Zisserman, A. Automated flower classification over a large number of classes. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008. [Google Scholar]
- Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; Perona, P. Caltech-UCSD Birds 200; Technical Report CNS-TR-2010-001; California Institute of Technology: Pasadena, CA, USA, 2010. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Zhu, X.; Goldberg, A.B.; Eldawy, M.; Dyer, C.R.; Strock, B. A text-to-picture synthesis system for augmenting communication. In Proceedings of the AAAI 2007, Vancouver, BC, Canada, 22–26 July 2007; Volume 7, pp. 1590–1595. [Google Scholar]
- Zhang, Z.; Xie, Y.; Yang, L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Chen, Q.; Koltun, V. Photographic Image Synthesis with Cascaded Refinement Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Sangkloy, P.; Lu, J.; Fang, C.; Yu, F.; Hays, J. Scribbler: Controlling Deep Image Synthesis With Sketch and Color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Nie, D.; Trullo, R.; Lian, J.; Petitjean, C.; Ruan, S.; Wang, Q.; Shen, D. Medical image synthesis with context-aware generative adversarial networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Quebec City, QC, Canada, 11–13 September 2017. [Google Scholar]
- Dong, H.; Yu, S.; Wu, C.; Guo, Y. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Liang, X.; Lee, L.; Dai, W.; Xing, E.P. Dual motion GAN for future-flow embedded video prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the CVPR 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1947–1962. [Google Scholar] [CrossRef] [PubMed]
- Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv 2017, arXiv:1711.10485. [Google Scholar]
- Afshar, P.; Mohammadi, A.; Plataniotis, K.N. Brain Tumor Type Classification via Capsule Networks. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018. [Google Scholar]
- Lukic, V.; Brüggen, M.; Mingo, B.; Croston, J.H.; Kasieczka, G.; Best, P.N. Morphological classification of radio galaxies: Capsule networks versus convolutional neural networks. Mon. Not. R. Astron. Soc. 2019, 487, 1729–1744. [Google Scholar] [CrossRef]
- Hilton, C.; Parameswaran, S.; Dotter, M.; Ward, C.M.; Harguess, J. Classification of maritime vessels using capsule networks. In Geospatial Informatics IX; SPIE: Bellingham, WA, USA, 2019; Volume 10992, pp. 87–93. [Google Scholar]
- Bass, C.; Dai, T.; Billot, B.; Arulkumaran, K.; Creswell, A.; Clopath, C.; De Paola, V.; Bharath, A.A. Image synthesis with a convolutional capsule generative adversarial network. In Proceedings of the International Conference on Medical Imaging with Deep Learning, London, UK, 8–10 July 2019. [Google Scholar]
- Jaiswal, A.; AbdAlmageed, W.; Wu, Y.; Natarajan, P. Capsulegan: Generative adversarial capsule network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Upadhyay, Y.; Schrater, P. Generative adversarial network architectures for image synthesis using capsule networks. arXiv 2018, arXiv:1806.03796. [Google Scholar]
- Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-thought vectors. In Proceedings of the 2015 Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–10 December 2015. [Google Scholar]
- Xie, J.; Yu, F.R.; Huang, T.; Xie, R.; Liu, J.; Wang, C.; Liu, Y. A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges. IEEE Commun. Surv. Tutorials 2019, 21, 393–430. [Google Scholar] [CrossRef]
- Nguyen, T.; Vu, P.; Pham, H.; Nguyen, T. Deep learning UI design patterns of mobile apps. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER), Gothenburg, Sweden, 27 May–3 June 2018. [Google Scholar]
- Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming auto-encoders. In Proceedings of the International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011. [Google Scholar]
- Hinton, G.E.; Sabour, S.; Frosst, N. Matrix capsules with EM routing. In Proceedings of the 6th International Conference on Learning Representations ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Li, S.; Ren, X.; Yang, L. Fully CapsNet for Semantic Segmentation: First Chinese Conference. In Proceedings of the PRCV 2018, Guangzhou, China, 23–26 November 2018. [Google Scholar]
- Nair, P.; Doshi, R.; Keselj, S. Pushing the limits of capsule networks. arXiv 2018, arXiv:2103.08074. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Barratt, S.; Sharma, R.K. A Note on the Inception Score. arXiv 2018, arXiv:1801.01973. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Borji, A. Pros and cons of gan evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef]
Text | Generated Images |
---|---|
This flower has long red petals with black center. | |
A water flower with light yellow petals and yellow pistils in the center. | |
This flower has purple petals and a long stigma. | |
This flower has rounded white petals which form a bright yellow shape in the center. | |
This bird is dark grey in color and has a long wings and a black downward curved beak. | |
The bird is a royal blue with black accents on the wings, tail and beak. | |
White smiling dog. |
Capsule | Neuron | ||
---|---|---|---|
Input | vector(ui) | scalar(xi) | |
Operations | Linear/Affine Transformation | aji = wijxi + bj | |
Weighting/Summation | zj = 1.aji | ||
Activation Function | vj = squash(sj) | hw,b(x) = f(zj) | |
Output | vector(vj) | scalar(h) |
Layer Number | Layer Type | Input Size | Filter Size | Kernal Size | Strides | Activation Function | Output |
---|---|---|---|---|---|---|---|
1 | Convolutional Layer | 64 × 64 × 3 | 64 | [5, 5] | [2, 2] | LeakyReLU | 32 × 32 × 64 |
2 | Convolutional Layer | 32 × 32 × 64 | 128 | [5, 5] | [2, 2] | LeakyReLU | 16 × 16 × 128 |
3 | Convolutional Layer | 16 × 16 × 128 | 256 | [5, 5] | [2, 2] | LeakyReLU | 8 × 8 × 256 |
4 | Convolutional Layer | 8 × 8 × 256 | 512 | [5, 5] | [2, 2] | LeakyReLU | 4 × 4 × 512 |
5 | Capsule Layer | 4 × 4 × 768 | 512 | [1, 1] | [1, 1] | LeakyReLU + Squashing Function | 4 × 4 × 512 × 4 × 1 |
Dataset | Total Number of Images | Total Categories | Number of Captions per Image |
---|---|---|---|
Oxford-102—Flower [8] | 8189 | 102 | 10 |
Caltech-UCSD Birds 200 [9] | 6033 | 200 | 5 |
ImageNet Dogs [10] | 4002 | 25 | 1 |
Parameters | Value |
---|---|
Batch Size | 32 |
Epochs | 100 |
Input Image Size | 64 × 64 × 3 |
Generated Image Size | 64 × 64 × 3 |
Horizontal Resolution | 96 dpi |
Vertical Resolution | 96 dpi |
Bit Depth | 24 |
Noise | 100 |
Text Embedding | 256 |
Caption Vector | 2400 |
Learning Rate | 0.0002 |
Momentum for Adam Update | 0.5 |
Capsule Vector | [4, 1] |
Generator Loss | Sigmoid Cross Entropy Given Logits |
Discriminator Loss | Sigmoid Cross Entropy Given Logits |
Dataset | IS | FID |
---|---|---|
Oxford-102—Flower | 4.05 ± 0.050 | 47.38 |
Caltech-UCSD Birds 200 | 4.61 ± 0.1 | 14.98 |
ImageNet Dogs | 13.11 ± 0.407 | 38.18 |
Input Text | Epoch | Examples of Generated Images |
---|---|---|
This flower has petals that are yellow and has black stamen. | 25 | |
100 | ||
The pretty flower has a lot of short blue petals. | 25 | |
100 | ||
The flower has petals that are yellow with orange spots. | 25 | |
100 | ||
This flower is pink in color, with petals that are wavy and bunched together. | 25 | |
100 | ||
This flower has petals that are white with a yellow center. | 25 | |
100 | ||
This flower is red and tan in color, with petals that are spotted. | 25 | |
100 | ||
A flower with light yellow petals and yellow pistils in the center. | 25 | |
100 |
Model | Dataset | Oxford-102—Flower | Caltech-UCSD Birds 200 | ImageNet Dogs |
---|---|---|---|---|
GAN [1] | IS ↑ | 2.66 ± 0.03 | 2.78 ± 0.1 | 6.81 ± 0.76 |
FID ↓ | 76.98 | 53.89 | 98.01 | |
StackGAN [4] | IS ↑ | 3.20 ± 0.01 | 3.70 ± 0.04 | 8.84 ± 0.08 |
FID ↓ | 55.28 | 51.89 | 89.21 | |
StackGAN++ [21] | IS ↑ | 3.26 ± 0.01 | 4.04 ± 0.5 | 9.55 ± 0.11 |
FID ↓ | 48.68 | 15.30 | 44.54 | |
TAC-GAN [3] | IS ↑ | 3.45 ± 0.05 | - | - |
FID ↓ | - | - | - | |
CapGAN | IS ↑ | 4.05 ± 0.050 | 4.12 ± 0.023 | 11.35 ± 0.11 |
FID ↓ | 44.38 | 11.89 | 34.36 |
Text | Model | Sample of Images Generated |
---|---|---|
This flower has long yellow petals that are curved down and a black center with black anthers on it. | GAN | |
CapGAN | ||
This flower is white and yellow in color, and has petals that are yellow near the center. | GAN | |
CapGAN | ||
This flower is pink in color, and has petals that are oddly shaped and vertically layered. | GAN | |
CapGAN | ||
This is a bird with grey wings, a white neck and a black beak. | GAN | |
CapGAN | ||
This bird is red in color, with black wings. | GAN | |
CapGAN | ||
This particular bird has a belly that is gray and yellow. | GAN | |
CapGAN |
Text | Ground Truth | Generated Images Using CapGAN |
---|---|---|
This flower has a white petal with a yellow center. | ||
This flower has red petals with white center. | ||
This flower has a yellow petal with orange spots. | ||
This flower has pink petals with a pink center. | ||
This bird is yellow and black in color, with a long black beak. | ||
This particular bird has a belly that is gray and white. | ||
This is a brown and beige bird and brown on the crown | ||
White Shih-Tzu |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Omar, M.; Ur Rehman, H.; Samin, O.B.; Alazab, M.; Politano, G.; Benso, A. CapGAN: Text-to-Image Synthesis Using Capsule GANs. Information 2023, 14, 552. https://doi.org/10.3390/info14100552
Omar M, Ur Rehman H, Samin OB, Alazab M, Politano G, Benso A. CapGAN: Text-to-Image Synthesis Using Capsule GANs. Information. 2023; 14(10):552. https://doi.org/10.3390/info14100552
Chicago/Turabian StyleOmar, Maryam, Hafeez Ur Rehman, Omar Bin Samin, Moutaz Alazab, Gianfranco Politano, and Alfredo Benso. 2023. "CapGAN: Text-to-Image Synthesis Using Capsule GANs" Information 14, no. 10: 552. https://doi.org/10.3390/info14100552
APA StyleOmar, M., Ur Rehman, H., Samin, O. B., Alazab, M., Politano, G., & Benso, A. (2023). CapGAN: Text-to-Image Synthesis Using Capsule GANs. Information, 14(10), 552. https://doi.org/10.3390/info14100552