Shared Task on Medical Video Question Answering

MedVidQA 2022


One of the key goals of artificial intelligence (AI) is the development of a multimodal system that facilitates communication with the visual world (image, video) using a natural language query. In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. With increasing interest in AI to support clinical decision-making and improve patient engagement, there is a need to explore such challenges and develop efficient algorithms for medical language-video understanding. This shared task introduces a new challenge to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions.

Consider a medical question, “How to place a tourniquet in case of fingertip avulsions? ”, the textual answer to this question will be hard to understand and act upon without visual aid. In order to provide visual aid, first, we need to identify the relevant video, which is medical and instructional in nature. Once we find a relevant video, it is often the case that the entire video can not be considered as the answer to the given question. Instead, we want to refer to a particular temporal segment, or moment, from the video, where the answer is being shown or the explanation is illustrated in the video[1]. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions.


Important Dates

Join our Google Group for important updates! If you have any questions, ask in our Google Group or email us.

Registration and Submission


Task 1: Medical Video Classification (MVC)

Given an input video, the task is to categorize the video into one of the following classes:
Medical Instructional Medical Non-Instructional Non-medical
Sample videos from each video category

Task 2: Medical Visual Answer Localization (MVAL)

Given a medical or health-related question and a video, the task aims to locate the temporal segments (start and end timestamps) in the video where the answer to the medical question is being shown, or the explanation is illustrated in the video. This task seeks to find a video segment with a visual answer to the natural language question. The MVAL task can be considered as finding a series of “medical instructional activity-based frame localization” where a potential solution first searches for all medical instructional activity for a given medical question and then localizes a particular activity which is aligned to given medical or health-related question in an untrimmed medical-instructional video.

Sample question and its visual answer from the video.
For more details, please see our data desciption paper.


MVC Dataset
MVAL Dataset
The datasets can be downloaded from CodaLab.

Evaluation Metrics

MVC Evaluation
The evaluation metric will be (a) F1 Score on Medical Instructional Class, and (b) Average macro-level F1 score across all the classes.
MVAL Evaluation

We will evaluate the results using (a) Intersection over Union (IoU), and (b) mIoU which is the average IoU over all testing sample. Following [4], we will use “R@n, IoU = μ”, which denotes the percentage of questions for which, out of the top-n retrieved temporal segments, at least one predicted temporal segment intersects the ground truth temporal segment for longer than μ. We will use n=1 and μ ∈ {0.3, 0.5, 0.7} to evalaute the results.


Deepak Gupta NLM, NIH
Dina Demner-Fushman NLM, NIH


[1] Deepak Gupta, Kush Attal, and Dina Demner-Fushman. A Dataset for Medical Instructional Video Classification and Question Answering . arXiv preprint arXiv:2201.12888, 2022.
[2] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
[3] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
[4] Yitian Yuan, Tao Mei, and Wenwu Zhu. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9159–9166, 2019.