Text Descriptions of Actions and Objects Improve Action Anticipation

Georgia Institute of Technology, Google Research§
ICASSP 2025

Abstract

Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, additional and complementary information from different modalities help narrow down plausible action choices. Going beyond typical sources such as video and audio, we primarily explore how text descriptions of actions and objects leads to more accurate action anticipation, as they provide additional contextual cues, e.g., about the environment and its contents. We propose Multi-modal Contrastive Anticipative Transformer (M-CAT), which is trained in two stages, where the model first learns to align video and other modalities with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Through extensive experimental evaluation, we demonstrate that M-CAT outperforms baselines on the EpicKitchens datasets, and show that explicit incorporation of object and action information via their text descriptions leads to more effective action anticipation. Code available at https://github.com/ApoorvaBeedu/M-CAT.

Model Overview

Results

Qualitative Results

Video

Poster

BibTeX

@INPROCEEDINGS{10888324,
  author={Beedu, Apoorva and Haresamudram, Harish and Essa, Irfan},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Text Descriptions of Actions and Objects Improve Action Anticipation}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Accuracy;Codes;Predictive models;Signal processing;Transformers;Acoustics;Speech processing;Action Anticipation;Multi-Modal;VLM},
  doi={10.1109/ICASSP49660.2025.10888324}}