#PraCegoVer: A Large Dataset for Image Captioning in Portuguese
Abstract
:1. Summary
- We introduced the first dataset for the problem of image captioning with captions in Portuguese. We hope that #PraCegoVer dataset encourages more works addressing the automatic generation of descriptions in Portuguese. We also intend to contribute to the blind Portuguese speaker community.
- We developed an end-to-end framework for data collection, data preprocessing, and data analysis from a hashtag on Instagram, which is helpful for social media studies (Section 4). In addition, we carried a thorough exploratory analysis to identify the most significant image classes and topics within the captions.
- We proposed an algorithm to cluster post duplication based on visual and textual information to remove instances with similar content.
2. Related Work
3. Data Records
4. Method
4.1. Data Collection
4.2. Duplication Detection and Clustering
4.2.1. Duplications
Algorithm 1: Clustering duplications |
Require: number of posts n, distance matrices , , and thresholds , |
1: 2: for all do 3: for all do 4: if and then 5: 6: 7: end if 8: end for 9: end for 10: 11: 12: for all do 13: if then 14: 15: 16: 17: end if 18: end for 19: return {A list with sets of duplications clustered.} |
4.2.2. Duplication Clustering
4.3. Preprocessing and Data Analysis
4.3.1. Audio Description Processing
4.3.2. Image Processing
4.3.3. Image Clustering
4.3.4. Dataset Split
5. Technical Validation
5.1. Dataset Statistics
5.2. Visualization
5.3. Comparative Analysis
5.4. Experiments
5.4.1. Results and Analysis
5.4.2. Qualitative Analysis
6. Usage Notes
6.1. Motivation
6.1.1. For What Purpose Was the Dataset Created?
6.1.2. Who Created the Dataset?
6.1.3. Who Funded the Creation of the Dataset?
6.2. Composition
6.2.1. What Do the Instances That Comprise the Dataset Represent?
6.2.2. How Many Instances Are There in Total?
6.2.3. What Data Does Each Instance Consist of?
6.2.4. Is There a Label or Target Associated with Each Instance?
6.2.5. Are There Recommended Data Splits?
6.2.6. Are There Any Errors, Sources of Noise, or Redundancies in the Dataset?
6.2.7. Does the Dataset Contain Data That Might Be Considered Confidential?
6.2.8. Does the Dataset Contain Data That, if Viewed Directly, Might Be Offensive, Insulting, Threatening, or Might Otherwise Cause Anxiety?
6.2.9. Is It Possible to Identify Individuals Either Directly or Indirectly?
6.2.10. Does the Dataset Contain Data That Might Be Considered Sensitive in Any Way?
6.3. Collection Process
6.3.1. How Was the Data Associated with Each Instance Acquired?
6.3.2. What Mechanisms or Procedures Were Used to Collect the Data?
6.3.3. Who Was Involved in the Data Collection Process and How Were They Compensated?
6.3.4. Over What Time-Frame Was the Data Collected? Does This Time-Frame Match the Creation Time-Frame of the Data Associated with the Instances?
6.3.5. Were Any Ethical Review Processes Conducted?
6.3.6. Did You Collect the Data from the Individuals in Question Directly, or Obtain It via Third Parties or Other Sources?
6.3.7. Were the Individuals in Question Notified about the Data Collection?
6.3.8. Did the Individuals in Question Consent to the Collection and Use of Their Data?
6.3.9. Has an Analysis of the Potential Impact of the Dataset and Its Use on Data Subjects Been Conducted?
6.4. Preprocessing
6.4.1. Was Any Preprocessing of the Data Done?
6.4.2. Was the “Raw” Data Saved in Addition to the Preprocessed Data?
6.4.3. Is the Software Used to Preprocess the Instances Available?
6.5. Uses
6.5.1. Has the Dataset Been Used for Any Tasks Already?
6.5.2. Is There a Repository That Links to Any or All Papers or Systems That Use the Dataset?
6.5.3. What Other Tasks Could the Dataset Be Used for?
6.5.4. Is There Anything about the Composition of the Dataset or the Way It Was Collected and Preprocessed That Might Impact Future Uses?
6.6. Distribution
6.6.1. Will the Dataset Be Distributed to Third Parties Outside of the Entity on Behalf of Which the Dataset Was Created?
6.6.2. How Will the Dataset Be Distributed?
6.6.3. When Will the Dataset Be Distributed?
6.6.4. Will the Dataset Be Distributed under a Copyright or Other Intellectual Property License, and/or under Applicable Terms of Use?
6.6.5. Do Any Export Controls or Other Regulatory Restrictions Apply to the Dataset or to Individual Instances?
6.7. Maintenance
6.7.1. Who Is Supporting/Maintaining the Dataset?
6.7.2. How Can the Curator of the Dataset Be Contacted?
6.7.3. Will the Dataset Be Updated?
6.7.4. Will Older Versions of the Dataset Continue to Be Maintained?
6.7.5. If Others Want to Contribute to the Dataset, Is There a Mechanism for Them to Do So?
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
1 | Hereinafter, we use caption and description interchangeably. |
2 | https://github.com/larocs/PraCegoVer, accessed on 14 January 2022. |
3 | |
4 | The opinions expressed in this work do not necessarily reflect those of the funding agencies. |
5 | Resolution no. 510/2016 (in Portuguese): http://conselho.saude.gov.br/resolucoes/2016/Reso510.pdf, accessed on 14 January 2022. |
References
- Web para Todos. Criadora do Projeto #PraCegoVer Incentiva a Descrição de Imagens na Web. 2018. Available online: http://mwpt.com.br/criadora-do-projeto-pracegover-incentiva-descricao-de-imagens-na-web (accessed on 14 January 2022).
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef] [Green Version]
- Plummer, B.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis. 2017, 123, 74–93. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollar, P.; Zitnick, C.L. Microsoft COCO captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 2556–2565. [Google Scholar] [CrossRef] [Green Version]
- Sidorov, O.; Hu, R.; Rohrbach, M.; Singh, A. Textcaps: A dataset for image captioning with reading comprehension. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 742–758. [Google Scholar] [CrossRef]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 483–498. [Google Scholar] [CrossRef]
- Rosa, G.; Bonifacio, L.; de Souza, L.; Lotufo, R.; Nogueira, R.; Melville, J. A cost-benefit analysis of cross-lingual transfer methods. arXiv 2021, arXiv:2105.06813. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef] [Green Version]
- Rashtchian, C.; Young, P.; Hodosh, M.; Hockenmaier, J. Collecting image annotations using amazon’s mechanical turk. In Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk; Association for Computational Linguistics: Los Angeles, CA, USA, 2010; pp. 139–147. [Google Scholar]
- Farhadi, A.; Hejrati, M.; Sadeghi, M.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–29. [Google Scholar] [CrossRef] [Green Version]
- Elliott, D.; Keller, F. Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 1292–1302. [Google Scholar]
- Zitnick, C.; Parikh, D.; Vanderwende, L. Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1681–1688. [Google Scholar] [CrossRef]
- Kong, C.; Lin, D.; Bansal, M.; Urtasun, R.; Fidler, S. What are you talking about? Text-to-image coreference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3558–3565. [Google Scholar] [CrossRef] [Green Version]
- Harwath, D.; Glass, J. Deep multimodal semantic embeddings for speech and images. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA, 13–17 December 2015; pp. 237–244. [Google Scholar] [CrossRef] [Green Version]
- Gan, C.; Gan, Z.; He, X.; Gao, J.; Deng, L. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3137–3146. [Google Scholar] [CrossRef]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Levinboim, T.; Thapliyal, A.; Sharma, P.; Soricut, R. Quality Estimation for Image Captions Based on Large-scale Human Evaluations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3157–3166. [Google Scholar] [CrossRef]
- Hsu, T.; Giles, C.; Huang, T. SciCap: Generating Captions for Scientific Figures. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 3258–3264. [Google Scholar]
- Lam, Q.; Le, Q.; Nguyen, V.; Nguyen, N. UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning. In International Conference on Computational Collective Intelligence; Springer International Publishing: Cham, Switzerland, 2020; pp. 730–742. [Google Scholar] [CrossRef]
- Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S.; Anderson, P. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8948–8957. [Google Scholar] [CrossRef] [Green Version]
- Krasin, I.; Duerig, T.; Alldrin, N.; Ferrari, V.; Abu-El-Haija, S.; Kuznetsova, A.; Rom, H.; Uijlings, J.; Popov, S.; Kamali, S.; et al. Openimages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. 2017. Available online: https://storage.googleapis.com/openimages/web/index.html (accessed on 14 January 2022).
- Gurari, D.; Zhao, Y.; Zhang, M.; Bhattacharya, N. Captioning images taken by people who are blind. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 417–434. [Google Scholar] [CrossRef]
- Park, C.; Kim, B.; Kim, G. Attend to you: Personalized image captioning with context sequence memory networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 895–903. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef] [Green Version]
- Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 1999, 61, 611–622. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- McInnes, L.; Healy, J.; Astels, S. HDBSCAN: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar] [CrossRef]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar] [CrossRef] [Green Version]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar] [CrossRef] [Green Version]
- Santos, G.; Colombini, E.; Avila, S. CIDEr-R: Robust Consensus-based Image Description Evaluation. In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 351–360. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar] [CrossRef] [Green Version]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef] [Green Version]
- Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Lavie, A.; Agarwal, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 228–231. [Google Scholar] [CrossRef] [Green Version]
- Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.; Wallach, H.; Daumé, H., III; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
Dataset | Dataset Size | Train Size | Validation Size | Test Size | Vocabulary Size | Avg. Sent. Length | Std. Sent. Length |
---|---|---|---|---|---|---|---|
MS COCO | 123,287 | 113,287 | 5000 | 5000 | 13,508 | 10.6 | 2.2 |
#PraCegoVer-63K | 62,935 | 37,881 | 12,442 | 12,612 | 55,029 | 37.8 | 26.8 |
#PraCegoVer-173K | 173,337 | 104,004 | 34,452 | 34,882 | 93,085 | 39.3 | 29.7 |
Dataset | CIDEr-D | ROUGE-L | METEOR | BLUE-4 |
---|---|---|---|---|
MS COCO Captions | 120.5 ± 0.3 | 57.5 ± 0.2 | 27.7 ± 0.0 | 36.5 ± 0.1 |
#PraCegoVer-63K | 4.7 ± 0.7 | 14.5 ± 0.4 | 7.1 ± 0.1 | 1.6 ± 0.2 |
#PraCegoVer-173K | 3.0 ± 0.2 | 12.9 ± 0.2 | 5.3 ± 0.1 | 0.9 ± 0.0 |
Words | #Occurences | Ranking |
---|---|---|
idiota | 143 | 28,520 |
puta | 93 | 30,246 |
trouxa | 62 | 39,845 |
viado | 61 | 40,780 |
caralho | 29 | 56,239 |
retardado(s)/retardada(s) | 40 | 57,707 |
imbecil | 49 | 68,800 |
quenga | 14 | 82,439 |
escroto(s)/escrota(s) | 25 | 82,865 |
mulato(s)/mulata(s) | 51 | 85,614 |
sapatona | 11 | 92,921 |
xana | 10 | 96,428 |
vadia | 10 | 107,233 |
Words | #Occurrences | Ranking |
---|---|---|
gordo | 467 | 18,145 |
gordos | 26 | 62,002 |
gorda | 806 | 8579 |
gordas | 176 | 27,263 |
magro | 286 | 20,161 |
magros | 20 | 68,425 |
magra | 410 | 17,853 |
magras | 96 | 30,195 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
dos Santos, G.O.; Colombini, E.L.; Avila, S. #PraCegoVer: A Large Dataset for Image Captioning in Portuguese. Data 2022, 7, 13. https://doi.org/10.3390/data7020013
dos Santos GO, Colombini EL, Avila S. #PraCegoVer: A Large Dataset for Image Captioning in Portuguese. Data. 2022; 7(2):13. https://doi.org/10.3390/data7020013
Chicago/Turabian Styledos Santos, Gabriel Oliveira, Esther Luna Colombini, and Sandra Avila. 2022. "#PraCegoVer: A Large Dataset for Image Captioning in Portuguese" Data 7, no. 2: 13. https://doi.org/10.3390/data7020013
APA Styledos Santos, G. O., Colombini, E. L., & Avila, S. (2022). #PraCegoVer: A Large Dataset for Image Captioning in Portuguese. Data, 7(2), 13. https://doi.org/10.3390/data7020013