TREC Task on Medical Video Question Answering

MedVidQA 2024


The recent surge in the availability of online videos has changed the way of acquiring information and knowledge. Many people prefer instructional videos to teach or learn how to accomplish a particular task in an effective and efficient manner with a series of step-by-step procedures. Similarly, medical instructional videos are more suitable and beneficial for delivering key information through visual and verbal communication to consumers' healthcare questions that demand instruction. We aim to extract the visual information from the video corpus for consumers' first aid, medical emergency, and medical educational questions. Extracting the relevant information from the video corpus requires relevant video retrieval, moment localization, video summarization, and captioning skills. Toward this, the TREC task, Medical Video Question Answering, focuses on developing systems capable of understanding medical videos and providing visual answers (from single and multiple videos) and instructional step captions to answer natural language questions. Emphasizing the importance of multimodal capabilities, the task requires systems to generate instructional questions and captions based on medical video content. Following the MedVidQA 2023, TREC 2024 expanded the tasks considering language-video understanding and generation. This track is comprised of two main tasks: Video Corpus Visual Answer Localization (VCVAL) and Query-Focused Instructional Step Captioning (QFISC).


Important Dates

Join our Google Group for important updates! If you have any questions, ask in our Google Group or email us.

Registration and Submission



Evaluation Metrics


Deepak Gupta NLM, NIH
Dina Demner-Fushman NLM, NIH


[1] Deepak Gupta, Kush Attal, and Dina Demner-Fushman. A Dataset for Medical Instructional Video Classification and Question Answering, Sci Data 10, 158 (2023)
[2] Zhong Ji, Yaru Ma, Yanwei Pang, and Xuelong Li. Query-aware sparse coding for web multi- video summarization. Information Sciences, 478:152–166, 2019.
[3] Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oguz, Yashar Mehdad, and Mohit Bansal. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023.
[4] Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image description evaluation." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566-4575. 2015.
[5] Anderson, Peter, Basura Fernando, Mark Johnson, and Stephen Gould. "Spice: Semantic propositional image caption evaluation." In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 382-398. Springer International Publishing, 2016.
[6] Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. "Bertscore: Evaluating text generation with bert." International Conference on Learning Representations (2020).