Caption Generation of Images Using CNN and LSTM
Keywords:
Long Short Term Memory (LSTM), Deep Learning, Neural Network, Image, Caption, DescriptionAbstract
The contents of a picture are automatically created in Artificial Intelligence (AI), which combines computer vision and natural language processing (NLP) (Natural Language Processing). It is developed a regenerative neuronal model. Computer vision and machine translation are required. This model is used to produce natural-sounding phrases that describe the picture. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are used in this model (RNN). The CNN is used to extract features from images, while the RNN is used to generate sentences. The model has been trained in such a manner that when an input image is provided to it, it creates captions that almost accurately describe the image. On various datasets, the model's accuracy, smoothness, and command of language learned from picture descriptions are assessed. These tests reveal that the model typically provides correct descriptions of the input image.
Downloads
References
Vinyals, Oriol, et al. ”Show and tell: A neural image caption generator.” Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015.
Gerber, Ralf, and N-H. Nagel. ”Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences.” Image Processing, 1996. Proceedings., International Conference on. Vol. 2. IEEE, 1996.
Yao, Benjamin Z., et al. ”I2t: Image parsing to text description.” Proceedings of the IEEE 98.8 (2010): 1485- 1508.
Farhadi, Ali, et al. ”Every picture tells a story: Generating sentences from images.” Euro-pean conference on computer vision. Springer, Berlin, Heidelberg, 2010.
Yang, Yezhou, et al. ”Corpus-guided sentence generation of natural images.” Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.
Kulkarni, Girish, et al. "Babytalk: Understanding and generating simple image descriptions." IEEE Transactions on Pattern Analysis and Machine Intelligence 35.12 (2013): 2891-2903.
Mitchell, Margaret, et al. "Midge: Generating image descriptions from computer vision detections." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012.
Kuznetsova, Polina, et al. "Collective generation of natural image descriptions." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012.
Jia, Xu, et al. ”Guiding long-short term memory for image caption generation.” arXiv pre-print arXiv:1509.04942 (2015).
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
Mao, Junhua, et al. ”Deep captioning with multimodal recurrent neural networks (m-rnn).” arXiv preprint arXiv:1412.6632 (2014).
Karpathy, Andrej, and Li Fei-Fei. ”Deep visual-semantic alignments for generating image descriptions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.
El Housseini, Ali, Abdelmalek Toumi, and Ali Khenchaf. ”Deep Learning for target recognition from SAR images.” Detection Systems Architectures and Technologies (DAT), Seminar on. IEEE, 2017.
Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Lu, Jiasen, et al. ”Knowing when to look: Adaptive attention via a visual sentinel for im-age captioning.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 6. 2017.
Ordonez, Vicente, Girish Kulkarni, and Tamara L. Berg. ”Im2text: Describing images us-ing 1 million captioned photographs.” Advances in neural information processing systems. 2011.
Chen, Xinlei, and C. Lawrence Zitnick. ”Mind’s eye: A recurrent visual representation for image caption generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Feng, Yansong, and Mirella Lapata. ”How many words is a picture worth? automatic caption generation for news images.” Proceedings of the 48th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.
Rashtchian, Cyrus, et al. ”Collecting image annotations using Amazon’s Mechanical Turk.” Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.
Association for Computational Linguistics, 2010.