Caption Generation of Images Using CNN and LSTM

Authors

  • Ummar Yousuf M.Tech Student, Department Electronics and Communication Engineering, RIMT University Mandi Gobindgarh, Punjab India Author
  • Ravinder Pal Singh Technical Head, Department of Research, Innovation and Incubation, RIMT University, Mandi Gobindgarh, Punjab India Author
  • Monika Mehra Head of Department, Department of Electronics and Communication Engineering, RIMT University, Punjab India Author

Keywords:

Long Short Term Memory (LSTM), Deep Learning, Neural Network, Image, Caption, Description

Abstract

The contents of a picture are automatically  created in Artificial Intelligence (AI), which combines  computer vision and natural language processing (NLP)  (Natural Language Processing). It is developed a  regenerative neuronal model. Computer vision and machine  translation are required. This model is used to produce  natural-sounding phrases that describe the picture.  Convolutional Neural Networks (CNN) and Recurrent  Neural Networks (RNN) are used in this model (RNN). The  CNN is used to extract features from images, while the  RNN is used to generate sentences. The model has been  trained in such a manner that when an input image is  provided to it, it creates captions that almost accurately  describe the image. On various datasets, the model's  accuracy, smoothness, and command of language learned  from picture descriptions are assessed. These tests reveal  that the model typically provides correct descriptions of the  input image. 

Downloads

Download data is not yet available.

References

Vinyals, Oriol, et al. ”Show and tell: A neural image caption generator.” Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015.

Gerber, Ralf, and N-H. Nagel. ”Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences.” Image Processing, 1996. Proceedings., International Conference on. Vol. 2. IEEE, 1996.

Yao, Benjamin Z., et al. ”I2t: Image parsing to text description.” Proceedings of the IEEE 98.8 (2010): 1485- 1508.

Farhadi, Ali, et al. ”Every picture tells a story: Generating sentences from images.” Euro-pean conference on computer vision. Springer, Berlin, Heidelberg, 2010.

Yang, Yezhou, et al. ”Corpus-guided sentence generation of natural images.” Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.

Kulkarni, Girish, et al. "Babytalk: Understanding and generating simple image descriptions." IEEE Transactions on Pattern Analysis and Machine Intelligence 35.12 (2013): 2891-2903.

Mitchell, Margaret, et al. "Midge: Generating image descriptions from computer vision detections." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012.

Kuznetsova, Polina, et al. "Collective generation of natural image descriptions." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012.

Jia, Xu, et al. ”Guiding long-short term memory for image caption generation.” arXiv pre-print arXiv:1509.04942 (2015).

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

Mao, Junhua, et al. ”Deep captioning with multimodal recurrent neural networks (m-rnn).” arXiv preprint arXiv:1412.6632 (2014).

Karpathy, Andrej, and Li Fei-Fei. ”Deep visual-semantic alignments for generating image descriptions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.

El Housseini, Ali, Abdelmalek Toumi, and Ali Khenchaf. ”Deep Learning for target recognition from SAR images.” Detection Systems Architectures and Technologies (DAT), Seminar on. IEEE, 2017.

Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).

Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Lu, Jiasen, et al. ”Knowing when to look: Adaptive attention via a visual sentinel for im-age captioning.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 6. 2017.

Ordonez, Vicente, Girish Kulkarni, and Tamara L. Berg. ”Im2text: Describing images us-ing 1 million captioned photographs.” Advances in neural information processing systems. 2011.

Chen, Xinlei, and C. Lawrence Zitnick. ”Mind’s eye: A recurrent visual representation for image caption generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Feng, Yansong, and Mirella Lapata. ”How many words is a picture worth? automatic caption generation for news images.” Proceedings of the 48th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.

Rashtchian, Cyrus, et al. ”Collecting image annotations using Amazon’s Mechanical Turk.” Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.

Association for Computational Linguistics, 2010.

Downloads

Published

2022-02-03

How to Cite

Caption Generation of Images Using CNN and LSTM . (2022). International Journal of Innovative Research in Engineering & Management, 9(1), 1–5. Retrieved from https://acspublisher.com/journals/index.php/ijirem/article/view/11228