Soumyabrata Chaudhuri (Soumya)
I’m currently a MS in Computer Science student at UT Austin. Before that, I completed my B.Tech (Honors) degree in Computer Science and Engineering at Indian Institute of Technology (IIT), Bhubaneswar. During my undergraduate studies, I conducted research across interrelated domains, including Machine Learning, Computer Vision, Natural Language Processing, and Multi-Modal Learning.
For my B.Tech thesis, I collaborated with Microsoft India to improve the travel-planning process using large language models, under the supervision of Dr. Shreya Ghosh, Dr. Abhik Jana, and Dr. Manish Gupta. I spent a wonderful summer interning at the University of Alberta’s Vision and Learning Lab (MITACS GRI), where I worked with Prof. Li Cheng on motion imitation learning and text-to-3D generation. I also had an enriching research internship at IIT Kharagpur, collaborating with Dr. Saumik Bhattacharya on multi-modal learning and action recognition in videos.
Email /
GitHub /
Google Scholar /
Resume /
LinkedIn / 
|
|
Research
'*' indicates equal contribution.
|
|
TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning
Soumyabrata Chaudhuri, Pranav Purkar, Ritwik Raghav, Shubhojit Mallick, Manish Gupta, Abhik Jana, Shreya Ghosh
ACL (Main), 2025
paper /
code /
We introduce TripCraft, a spatiotemporally coherent travel planning dataset that integrates real world constraints, including public transit schedules, event availability, diverse attraction categories, and user personas for enhanced personalization.
|
|
Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos
Soumyabrata Chaudhuri, Saumik Bhattacharya
arxiv, 2024
paper /
code /
Simba is the inaugural attempt at leveraging the Mamba model for skeleton action recognition task in videos.
|
|
ViLP: Knowledge Exploration using Vision, Language and Pose Embeddings for Video Action Recognition
Soumyabrata Chaudhuri, Saumik Bhattacharya
ACM ICVGIP (Oral), 2023
paper /
code /
ViLP explores cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce the novel combination of pose, visual information, and text attributes.
|
|