Mamba Fusion: Learning Actions Through Questioning

Georgia Institute of Technology, Stony Brook University, Google Research§
ICASSP 2025

*Indicates Equal Contribution

Abstract

Video Language Models (VLMs) are crucial for general- izing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision- language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture a more comprehensive understanding of the actions in the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the- art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation. The code is available at https://github.com/Dongzhikang/MambaVL

Model Overview

Video

Poster

BibTeX

@inproceedings{beedu2025mamba,
  title={Mamba Fusion: Learning Actions Through Questioning},
  author={Beedu, Apoorva and Dong, Zhikang and Sheinkopf, Jason and Essa, Irfan},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}