Transformative Fusion: Vision Transformers and GPT-2 Unleashing New Frontiers in Image Captioning within Image Processing

Authors

  • Indrani Vasireddy Associate Professor, Department of Computer Science and Engineering, Geethanjali College of Engineering, Hyderabad, India Author
  • G HimaBindu Assistant Professor, Department of Computer Science and Engineering, Geethanjali College of Engineering, Hyderabad, India Author
  • B Ratnamala Assistant Professor, Department of Computer Science and Engineering, Geethanjali College of Engineering, Hyderabad, India Author

Keywords:

Image Caption Generator Vision Transformers (ViT) GPT-2 Computer Vision Natural Language Processing

Abstract

In the ever-evolving digital landscape, this paper presents an innovative Image Caption  Generator that seamlessly merges Vision Transformers  (ViT) and GPT-2. By combining the strengths of  computer vision and natural language processing (NLP),  our paper aims to extract significant image features using  ViT and generate contextual, human-like descriptions  through GPT-2. The resultant system boasts an intuitive  interface, allowing users to effortlessly receive coherent  captions for uploaded images. This ground breaking  technology holds immense potential for the visually  impaired community, enhancing image-based content  accessibility and overall user experiences. The primary objective of our image caption generator  paper is to develop a sys-tem that automates the  generation of descriptive and coherent textual captions  for images. This endeavor involves the integration of  computer vision and NLP techniques, enabling the system  to analyze the content of an image and produce relevant  and meaningful textual descriptions. The broader goal is  to improve the accessibility of visual content, enhance  image search capabilities, and facilitate applications such  as automated content tagging. Furthermore, the paper  addresses the needs of visually impaired individuals by  providing assistive technology that interprets and  communicates image content effectively. This paper exemplifies the symbiotic relationship  between computer vision and NLP, illustrating how their  integration can pave the way for transformative AI  applications. The resulting synergy not only contributes  to the development of advanced image captioning  systems but also opens avenues for innovative  applications across diverse domains. The conference  presentation will delve into the technical aspects of our  approach, showcasing the significance of this integration  and its potential impact on the future of AI applications. 

Downloads

Download data is not yet available.

References

Krishnakumar, K., Kousalya, S., Gokul, R., Karthikeyan, R., Kaviyarasu, D. (2020). "IMAGE CAPTION GENERATOR USING DEEP LEARNING," International Journal of Advanced Science and Technology.

R. Al Sobbahi and J. Tekli. "Low-light image enhancement using image-to-frequency fil-ter learning." In Image Analysis and Processing–ICIAP 2022: 21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II, pages 693–705. Springer, 2022.

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. "Flamingo: a visual language model for few-shot learning." Advances in Neural Information Processing Systems, 35:23716–23736, 2022.

P. Anderson, B. Fernando, M. Johnson, and S. Gould. "SPICE: Semantic propositional image caption evaluation." In Computer Vision – ECCV 2016, pages 382–398, Manhattan, New York, USA, 2016. Springer International Publishing.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. "Bottom-up and top-down attention for image captioning and visual question answering." In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6077–6086.

Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). "ImageNet Classification with Deep Con-volutional Neural Networks." In Advances in neural information processing systems.

Hochreiter, S., Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780. [8] Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2015). "Show and Tell: A Neural Image Caption Generator." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ...Houlsby, N. (2021). "Image Transformer." arXiv preprint arXiv:2010.11929.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I. (2017). "Attention is All You Need." In Advances in neural information processing systems

Vasireddy, Indrani, Rajeev Wankar, and Raghavendra Rao Chillarige. "Recreation of a Sub-pod for a Killed Pod with Optimized Containers in Kubernetes." International Conference on Expert Clouds and Applications. Singapore: Springer Nature Singapore, 2022.

Downloads

Published

2023-12-30

How to Cite

Transformative Fusion: Vision Transformers and GPT-2 Unleashing New Frontiers in Image Captioning within Image Processing . (2023). International Journal of Innovative Research in Engineering & Management, 10(6), 55–59. Retrieved from https://acspublisher.com/journals/index.php/ijirem/article/view/12290