Apoorva Beedu

I am a Sr. AI Research Engineer at Rivian VW Tech. I got my PhD at Georgia Institute of Technology where I worked on Video and Language Understanding and was advised by Dr. Irfan Essa and Dr. Justin Romberg.

My research interests are primarily focused on understanding video through the usage of Language cues for various tasks like summarization, action anticipation etc., and have previously worked on 6D object pose estimation and human activity recognition. Broadly, my research is in the union of

Video Understanding + Multi-Modal Learning + Foundation Models

During the summer of 2021, I was an intern at Facebook Reality Labs, collaborating with Chengde Wan and Robert Wang. In 2020, I undertook an internship at Microsoft Research with Dr. Amol Ambadekar. In the summers of '18 and '17, I interned at a startup called NodeIn, working with Dr. Suresh Kannan. Prior to joining Georgia Institute of Technology, I spent a year at Asteria Aerospace Limited as a software Engineer.

Email: apoorvabeedu [at] gmail [dot] com

GitHub / Google Scholar / LinkedIn / CV

News

June 2026: Our paper Audio2Tool: Speak, Call, Act – A Dataset for Benchmarking Speech Tool Use has been accepted to InterSpeech 2026! Check out the project page and dataset.
June 2026: Our paper HierSum: A Global and Local Attention Mechanism for Video Summarization has been accepted to the AI4RWC workshop at CVPR 2026!
July 2025: I successfully defended my PhD Thesis titled - "Learning Vision and Language Cues for Video Understanding in Egocentric and Instructional Videos"
June 2025: I am starting as a Sr. AI Research Engineer at Rivian VW Tech!
April 2025: Presenting our papers on Mamba Fusion and Action Anticipation at ICASSP on April 8th. Visit our posters in room 2H if you are at the venue!
Dec 2024: Our paper on Mamba Fusion and Action Anticipation have been accepted to ICASSP 2025!
Dec 2024: Our paper Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition--And Ways to Overcome Them has been accepted to AAAI 2025.

Show earlier news

Oct 2023: Presented our paper on Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition at PerDream@ICCV !
Dec 2022: Presented our papers Video based Object 6D Pose Estimation using Transformers and End-to-End Multimodal Representation Learning for Video Dialog at VTTA2022@NeuRIPS.
Oct 2022: Served as a reviewer for BMVC 2022 and NeuRIPS VTTA workshop.
Aug 2021: Served as a reviewer for BMVC 2021
May 2021: Starting summer internship at Facebook Reality Labs as a Research Intern - Interaction Tracking (PhD).
Sept 2020: Our paper on Masked Reconstruction Based Self-Supervision for Human Activity Recognition was accepted into ISWC 2020

Research

I'm interested in computer vision, machine learning, video analysis and multi-modal representations.

	HierSum: A Global and Local Attention Mechanism for Video Summarization Apoorva Beedu, Irfan Essa CVPR 2026 workshop on Vision Intelligence for Real-world Challenges pdf / bibtex / A hierarchical attention mechanism for instructional video summarization that combines fine-grained local cues from subtitles with global context from video-level instructions, using the “most replayed” statistic as supervision to identify the most important segments.
	Audio2Tool: Speak, Call, Act – A Dataset for Benchmarking Speech Tool Use Ramit Pahwa, Apoorva Beedu*, Parivesh Priye, Rutu Gandhi, Saloni Takawale, Aruna Baijal, Zengli Yan Interspeech 2026* pdf / bibtex / project page / A dataset for benchmarking speech tool use, spanning eight tiers of spoken commands from direct calls to multi-turn conversations and multi-speaker intent blending across smart car, smart home, and wearable domains. Dataset available on Hugging Face.
	Mamba Fusion: Learning Actions Through Questioning Apoorva Beedu, Zhikang Dong, Jason Sheinkopf, Irfan Essa ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pdf / code / bibtex / project page / We introduce MambaVL, an efficient mamba based fusion method for vision language models.
	Text Descriptions of Actions and Objects Improve Action Anticipation Apoorva Beedu, Harish Haresamudram, Irfan Essa ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pdf / bibtex / project page / supp / Introduces contrastive learning for Action Anticipation and evaluates using text data as additional modalities. Detailed technical report can be found here
	Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition--And Ways to Overcome Them Harish Haresamudram, Apoorva Beedu, Mashfiqui Rabbi, Sankalita Saha, Irfan Essa, Thomas Ploetz Proceedings of the AAAI Conference on Artificial Intelligence 2025 pdf / bibtex / project page / We investigate whether natural language supervision can be used for wearable sensor based Human Activity Recognition (HAR), and discover that surprisingly it performs substantially worse than standard end-to-end training and selfsupervision.
	Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition Hyeongju Choi, Apoorva Beedu, Irfan Essa ICCV 2023 workshop on PerDream: PERception, Decision making and REAsoning through Multimodal foundational modeling pdf / bibtex / Introduces contrastive learning with hard negative sampling for Human Activity Recognition.
	Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, Irfan Essa ArXiv preprint, 2022 pdf / bibtex / We present a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors for Human Activity Recognition.
	End-to-End Multimodal Representation Learning for Video Dialog Huda Alamri, Anthony Bilic, Michael Hu, Apoorva Beedu, Irfan Essa NeurIPS 2022 workshop on Vision Transformers: Theory and Applications pdf / bibtex / We present a framework for video based dialog task.
	Video based Object 6D Pose Estimation using Transformers Apoorva Beedu, Huda Alamri, Irfan Essa NeurIPS 2022 workshop on Vision Transformers: Theory and Applications pdf / code / bibtex / We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos.
	Masked reconstruction based self-supervision for human activity recognition Harish Haresamudram,Apoorva Beedu, Varun Agrawal, Patrick L Grady, Irfan Essa, Judy Hoffman, Thomas Plötz Proceedings of the 2020 ACM International Symposium on Wearable Computers pdf / bibtex / We introduce masked reconstruction as a viable self-supervised pre-training objective for human activity recognition and explore its effectiveness in comparison to state-of-the-art unsupervised learning techniques.
	Location based payload imaging Apoorva, J and Mohan, Brinda and Beedu, Apoorva and Nayak, Mahendra M and Rao, Divya and Agrawal, VK 2015 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT) pdf / bibtex / We present a method to add the capability of imaging at any commanded latitude and longitude by including a provision to estimate the time required to reach the desired latitude and longitude using Location-Based Payload Imaging.

Teaching

Aug 2018 - May 2025: Head TA for OMSCS 6476 Introduction to Computer Vision
Aug 2017 - Aug 2018: GTA for OMSCS 6476 Introduction to Computer Vision

Service

I have reviewed for BMVC(2021-2024), PerDream2023, VTTA2022.

Mentoring

Kara Bethany Liu (now a Machine Learning Engineer at Block)
Zhikang Dong (now a Machine Learning Engineer, Generative AI at Snap Inc.), Jason Sheinkopf (now a Machine Learning Engineer at Bosch) - Work led to the submission Mamba Fusion: Learning Actions Through Questioning(ICASSP 2025)
Hyeongju John Choi (now at Amazon Lab126) - Work led to a submission Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition (PerDream@ICCV2023)
Hrishikesh Kale (now a Machine Learning Accelerator Modelling Engineer at Google)

Design and source code from Jon Barron's website

Apoorva Beedu

News

Research

HierSum: A Global and Local Attention Mechanism for Video Summarization

Audio2Tool: Speak, Call, Act – A Dataset for Benchmarking Speech Tool Use

Mamba Fusion: Learning Actions Through Questioning

Text Descriptions of Actions and Objects Improve Action Anticipation

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition--And Ways to Overcome Them

Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

End-to-End Multimodal Representation Learning for Video Dialog

Video based Object 6D Pose Estimation using Transformers

Masked reconstruction based self-supervision for human activity recognition

Location based payload imaging

Teaching

Service

Mentoring