Video Captioning Based on Graph Neural Network Made from Action Knowledge and Object Features

doi:10.23940/ijpe.24.04.p3.214223

Abstract

Abstract: Encoder-decoder-based video captioning gives a holistic description as per the training data. These captions are missing the object’s motion-specific features. Object motion knowledge in video allows object-oriented video captions. Similarity action knowledge-based models allow action-based features. Furthermore, the traditional encoder-decoder method uses frame-level scene features. Advanced existing methods extract spatial and temporal features for extracting context vectors. The unavailability of methods to extract action knowledge with the object’s motion features limits the models to produce action-object-oriented captions. Further presence of multiple objects’ motion gives disoriented captions in state-of-the-art methods. We propose a method that is a partial grid-based method for action-object-oriented features. This facilitates comprehension of an object's motion and its interactions with other objects, as well as movement within the scene. The proposed method takes these features and constructs a graph neural network, which is then used with graph-based filters. Object activity and interaction based re-annotated 75 videos from MSVD datasets which were used for training, validation, and evaluation. The proposed model demonstrates object-action-based video captioning with object-action and object-background interaction. The BLEU and METEOR-based evaluation results demonstrate the workability of graph neural network-based methods and the superiority of the process.

Key words: video understanding, graph neural network, video captioning, object-level analysis, object-action video captions

Prashant Kaushik, Vikas Saxena, and Amarjeet Prajapati. Video Captioning Based on Graph Neural Network Made from Action Knowledge and Object Features [J]. Int J Performability Eng, 2024, 20(4): 214-223.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Pan B., Cai H., Huang D.A., Lee K.H., Gaidon A., Adeli E., and Niebles J.C.Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, 2020.
[2] Ryu H., Kang S., Kang H., and Yoo C.D.Semantic Grouping Network for Video Captioning. In proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2514-2522, 2021.
[3] Venugopalan S., Rohrbach M., Donahue J., Mooney R., Darrell T., and Saenko K.Sequence to Sequence-Video to Text. InProceedings of the IEEE international conference on computer vision, pp. 4534-4542, 2015.
[4] Lin K., Li L., Lin C.C., Ahmed F., Gan Z., Liu Z., Lu Y., and Wang L.Swinbert: End-to-End Transformers with Sparse Attention for Video Captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17949-17958, 2022.
[5] Kaushik P., Kumar K.V., and Biswas P.Context Bucketed Text Responses using Generative Adversarial Neural Network in and roid Application with Tens or Flow-Lite Framework. In2022 8th International Conference on Signal Processing and Communication (ICSC), IEEE, pp. 324-328, 2022.
[6] Zhu F., Hwang J.N., Ma Z., Chen G., and Guo J.Understand ing Objects in Video: Object-Oriented Video Captioning via Structured Trajectory and Adversarial Learning. IEEE Access, vol. 8, pp. 169146-169159, 2020.
[7] Hendria W.F., Velda V., Putra B.H.H., Adzaka, F., and Jeong, C. Action Knowledge for Video Captioning with Graph Neural Networks. Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 4, pp. 50-62, 2023.
[8] Yan C., Tu Y., Wang X., Zhang Y., Hao X., Zhang Y., and Dai Q.STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE transactions on multimedia, vol. 22, no. 1, pp. 229-241, 2019.
[9] Yan Y., Zhuang N., Ni B., Zhang J., Xu M., Zhang Q., Zhang Z., Cheng S., Tian Q., Xu Y., and Yang X.Fine-Grained Video Captioning via Graph-Based Multi-Granularity Interaction Learning. IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 2, pp. 666-683, 2019.
[10] Lin K., Gan Z., and Wang L.Augmented Partial Mutual Learning with Frame Masking for Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2047-2055, 2021.
[11] Cho K., Courville A., and Bengio Y.Describing Multimedia Content using Attention-Based Encoder-Decoder Networks. IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875-1886, 2015.
[12] Moniruzzaman M., Yin Z., He Z., Qin R., and Leu M.C.Human Action Recognition by Discriminative Feature Pooling and Video Segment Attention Model. IEEE Transactions on Multimedia, vol. 24, pp. 689-701, 2021.
[13] Yu Q., Song J., Song Y.Z., Xiang T., and Hospedales T.M.Fine-Grained Instance-Level Sketch-Based Image Retrieval. International Journal of Computer Vision, vol. 129, pp. 484-500, 2021.
[14] Gao L., Guo Z., Zhang H., Xu X., and Shen H.T.Video Captioning with Attention-Based LSTM and Semantic Consistency. IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045-2055, 2017.
[15] Prudviraj J., Reddy M.I., Vishnu C., and Mohan C.K.AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description. IEEE Transactions on Image Processing, vol. 31, pp. 5559-5569, 2022.
[16] Liu, Z.Y. and Liu, J.W.Hypergraph Attentional Convolutional Neural Network for Salient Object Detection. The Visual Computer, vol. 39, no. 7, pp. 2881-2907, 2023.
[17] Hua X., Wang X., Rui T., Shao F., and Wang D.Adversarial Reinforcement Learning with Object-Scene Relational Graph for Video Captioning. IEEE Transactions on Image Processing, vol. 31, pp. 2004-2016, 2022.
[18] Liu F., Ren X., Wu X., Yang B., Ge S., Zou Y., and Sun X.O2na: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning.arXiv preprint arXiv:2108.02359, 2021.
[19] Li L., Zhang Y., Tang S., Xie L., Li X., and Tian Q.Adaptive Spatial Location with Balanced Loss for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 17-30, 2020.
[20] Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., and Batra D.Grad-Cam: Visual Explanations from Deep Networks via Gradient-Based Localization. InProceedings of the IEEE international conference on computer vision, pp. 618-626, 2017.
[21] Donahue J.,Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625-2634, 2015.
[22] Xu J., Mei T., Yao T., and Rui Y.Msr-Vtt: A Large Video Description Dataset for Bridging Video and Language. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296, 2016.
[23] Krishna R., Hata K., Ren F., Fei-Fei, L., and Carlos Niebles, J. Dense-Captioning Events in Videos. InProceedings of the IEEE international conference on computer vision, pp. 706-715, 2017.
[24] Yu H., Wang J., Huang Z., Yang Y., and Xu W.Video Paragraph Captioning using Hierarchical Recurrent Neural Networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584-4593, 2016.
[25] Yu J., Li J., Yu Z., and Huang Q.Multimodal Transformer with Multi-View Visual Representation for Image Captioning. IEEE transactions on circuits and systems for video technology, vol. 30, no. 12, pp. 4467-4480, 2019.
[26] Islam S., Dash A., Seum A., Raj A.H., Hossain T., and Shah F.M.Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods. SN Computer Science, vol. 2, no. 2, pp. 1-28, 2021.
[27] Su Y., Li Y., Xu N., and Liu A.A.Hierarchical Deep Neural Network for Image Captioning. Neural Processing Letters, vol. 52, pp. 1057-1067, 2020.
[28] Wang T., Zheng H., Yu M., Tian Q., and Hu H.Event-Centric Hierarchical Representation for Dense Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1890-1900, 2020.
[29] Li J., Tan G., Ke X., Si H., and Peng Y.Object Detection Based on Knowledge Graph Network. Applied Intelligence, vol. 53, no. 12, pp. 15045-15066, 2023.
[30] Lu, X. and Gao, Y.Guide and Interact: Scene-Graph Based Generation and Control of Video Captions. Multimedia Systems, vol. 29, no. 2, pp. 797-809, 2023.
[31] Li, X. and Jiang, S.Know More Say Less: Image Captioning Based on Scene Graphs. IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117-2130, 2019.
[32] Zhang Z., Shi Y., Yuan C., Li B., Wang P., Hu W., and Zha Z.J.Object Relational Graph with Teacher-Recommended Learning for Video Captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13278-13288, 2020.
[33] Jo Y., Lee S., Lee A.S., Lee H., Oh H., and Seo M.Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment.arXiv preprint arXiv:2307.02682, 2023.
[34] Zhang Z., Xu D., Ouyang W., and Zhou L.Dense Video Captioning using Graph-Based Sentence Summarization. IEEE Transactions on Multimedia, vol. 23, pp. 1799-1810, 2020.