Skip to content

Latest commit

 

History

History
192 lines (134 loc) · 8.77 KB

README.md

File metadata and controls

192 lines (134 loc) · 8.77 KB

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

VideoQA Multi-Modal Video-MME
Gemini GPT-4V GPT-4o

Video-MME applies to both image MLLMs, i.e., generalizing to multiple images, and video MLLMs. 🌟


🔥 News

  • 2024.06.15 🌟 We have refreshed our evaluation: 1) replace broken and potentially broken video links, and re-annotated them; 2) GPT-4o now samples 384 frames (previously 10 from the website) at 512x512 resolution, boosting overall accuracy to 71.9%.
  • 2024.06.03 🌟 We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis!

👀 Video-MME Overview

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements, but their potential in processing sequential visual data is still insufficiently explored. We introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. It is designed to comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities. Video-MME comprises 900 videos with a total of 254 hours, and 2,700 human-annotated question-answer pairs. Our work distinguishes from existing benchmarks through four key features:

  • Duration in temporal dimension. Encompassing both short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics;
  • Diversity in video types. Spanning 6 primary visual domains, i.e., Knowledge, Film & Television, Sports Competition, Life Record, and Multilingual, with 30 subfields to ensure broad scenario generalizability;
  • Breadth in data modalities. Integrating multi-modal inputs besides video frames, including subtitles and audios, to assess the all-round capabilities of MLLMs;
  • Quality in annotations. All data are newly collected and annotated by humans, not from any existing video dataset, ensuring diversity and quality.

📐 Dataset Examples

Click to expand more examples

🔍 Dataset

License:

Video-MME is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.
If there is any infringement in Video-MME, please email [email protected] and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify Video-MME in whole or in part. 
You must strictly comply with the above restrictions.

Please send an email to [email protected]. 🌟

🔮 Evaluation Pipeline

📍 Extract Frames and Subtitles:

There are a total of 900 videos and 744 subtitles, where all long videos have subtitles.

With respect to the setting of adding subtitles, you should only use the subtitles corresponding to the sampled video frames. For example, if you extract 10 frames per video for evaluation, take the 10 subtitles that corresponding to the time of those 10 frames.

If you have already prepared the video and subtitle file, you could refer to this script to extract the frames and corresponding subtitles.

📍 Prompt:

The common prompt used in our evaluation follows this format:

This video's subtitles are listed below:
[Subtitles] 
Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option. 
[Question]
The best answer is:

For the subtitles-free setting, you should remove the subtitle content.

Click to expand the prompt examples.
  • With subtitles:
This video's subtitles are listed below:
Hi guys, I'm going to show you how to perfectly prepare a ...
Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
What is the color of the clothing worn by the persons in the video?
A. Black.
B. Gray.
C. Green.
D. Brown.
The best answer is:
  • Without subtitles:
Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
What is the color of the clothing worn by the persons in the video?
A. Black.
B. Gray.
C. Green.
D. Brown.
The best answer is:

📍 Evaluation:

To extract the answer and calculate the scores, we add the model response to a JSON file. Here we provide an example template output_test_template.json. Once you have prepared the model responses in this format, please refer to the evaluation script eval_your_results.py, and you will get the accuracy scores across video_durations, video domains, video subcategories, and task types. The evaluation does not introduce any third-party models, such as ChatGPT.

python eval_your_results.py \
    --results_file $YOUR_RESULTS_FILE \
    --video_duration_type $VIDEO_DURATION_TYPE \
    --return_categories_accuracy \
    --return_sub_categories_accuracy \
    --return_task_types_accuracy

Please ensure that the results_file follows the specified JSON format stated above, and video_duration_type is specified as either short, medium, or long. If you wish to assess results across various duration types, you can specify multiple types separated by commas or organize them in a list, for example: short,medium,long or ["short","medium","long"].

📍 Leaderboard:

If you want to add your model to our leaderboard, please send model responses to [email protected], as the format of output_test_template.json.

📈 Experimental Results

  • Evaluation results of different MLLMs.

  • Evaluation results of different MLLMs across different task types.

  • Evaluation results of Gemini 1.5 Pro across different video duration types.

  • Evaluation results of Gemini 1.5 Pro across different video sub-types.

✒️ Citation

If you find our work helpful for your research, please consider citing our work.

@article{fu2024video,
  title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
  author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
  journal={arXiv preprint arXiv:2405.21075},
  year={2024}
}

📜 Related Works

Explore our related researches: