Performance Analysis of Video–Audio Action Recognition Using a Cross-Attention-Based Multimodal Fusion Architecture 


Vol. 15,  No. 2, pp. 113-120, Feb.  2026
https://doi.org/10.3745/TKIPS.2026.15.2.113


PDF
  Abstract

This study analyzes the performance of various fusion strategies for audio-visual action recognition, focusing on cross-attention as the core mechanism. Visual and audio features are extracted using Swin-Transformer-based encoders, and four fusion architectures are designed by sequentially applying simple operations (concatenation, summation, multiplication), cross-attention, channel-wise gating, and self-attention. All models are evaluated on the Kinetics-Sound dataset under consistent training conditions. Experimental results show that multimodal fusion improves performance by up to 13 percentage points compared to single-modal baselines. In particular, cross-attention effectively learns semantic alignment between visual and audio modalities, contributing to improved accuracy. The final model incorporating self-attention achieves a Top-1 Accuracy of 87.20% and an F1-score of 87.02%. This study provides practical insights into the design of efficient multimodal fusion architectures that capture complex interactions between visual and auditory modalities.

  Statistics


  Cite this article

[IEEE Style]

J. H. Kim, "Performance Analysis of Video–Audio Action Recognition Using a Cross-Attention-Based Multimodal Fusion Architecture," The Transactions of the Korea Information Processing Society, vol. 15, no. 2, pp. 113-120, 2026. DOI: https://doi.org/10.3745/TKIPS.2026.15.2.113.

[ACM Style]

Jun Hwa Kim. 2026. Performance Analysis of Video–Audio Action Recognition Using a Cross-Attention-Based Multimodal Fusion Architecture. The Transactions of the Korea Information Processing Society, 15, 2, (2026), 113-120. DOI: https://doi.org/10.3745/TKIPS.2026.15.2.113.