From Multimodal LLM to Human-level AI:

Evaluations and Benchmarks

About

MLLM Tutorial

Welcome to the MLLM Tutorial series on CVPR 2025!

Despite existing various emerging benchmarks for evaluating Multimodal Large Language Models (MLLMs), the evaluation of MLLMs validity and effectiveness might remain open to further discussion. This tutorial addresses the need for comprehensive and scientifically valid benchmarks in MLLM development. The tutorial will offer a systematic overview of current MLLM benchmarks and discuss necessary performance enhancements for achieving human-level AGI. We will introduce recent developments in MLLMs, survey benchmarks, and explore evaluation methods. Detailed discussions will cover vision-language capabilities, video modality evaluations, and expert-level skills across multiple disciplines. We’ll further identify gaps in benchmarking the multimodal generalists, and introduce methods to comprehensively evaluate MLLMs towards multimogal AGI. Finally, a special focus will be on addressing and mitigating the frequent hallucination phenomena in MLLMs to enhance model reliability



🔔News

🔥[2025-05-10]: Our tutorial will be held at CVPR'25, seeing you at Music City Center, Nashville TN (Wed June 11th - Sun June 15th)!


Organizer

Presenters

https://haofei.vip/

Hao Fei

National University of Singapore

Xiang Yue

Carnegie Mellon University

Kaipeng Zhang

Shanghai AI Lab

Long Chen

HKUST

Jian Li

Tencent YoutuLab

Xinya Du

University of Texas at Dallas

Schedule

PROGRAM

The program date and timeline is not determined yet, and will be updated later.



Time (TBD) Section Presenter
13:30-13:35 Part 1: Background and Introduction [Slides] Hao Fei
13:35-14:05 Part 2: Existing MLLM Benchmark Overall Survey [Slides] Jian Li
14:05-14:35 Part 3: Vision-Language Capability Evaluation [Slides] Kaipeng Zhang
14:35-15:05 Part 4: Video Capability Evaluation [Slides] Long Chen
Coffee Break, Q&A Session
16:00-16:30 Part 5: Expert-level Discipline Capability Evaluation [Slides] Xiang Yue
16:30-17:00 Part 6: Comprehensive Evaluation: Path to Multimodal Generalist [Slides] Hao Fei
17:00-17:30 Part 7: MLLM Hallucination Evaluation [Slides] Xinya Du
17:30-18:00 Part 8: Panel Discussion - From MM Generalist to Human-level AI: from evaluation and benchmark perspective All

Tutorial Record

Video

TBD

Literature

Reading List

Part I: Survey

  1. Li, et al., 2024, A Survey on Benchmarks of Multimodal Large Language Models
  2. Li, et al., 2024, A Survey on Multimodal Benchmarks: In the Era of Large AI Models
  3. Huang, et al., 2024, A Survey on Evaluation of Multimodal Large Language Models
  4. Fu, et al., 2024, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Part II: Benchmarks

  1. Yue, et al., 2024, MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
  2. Fei, et al., 2025, On Path to Multimodal Generalist: General-Level and General-Bench
  3. Li, et al., 2023, MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
  4. Fu, et al., 2023, MME: A comprehensive evaluation benchmark for multimodal large language models
  5. Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
  6. Yu, et al., 2023, MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
  7. Xia, et al., 2024, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
  8. Wu, et al., 2023, Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
  9. Liu, et al., 2023, MMBench: Is Your Multi-modal Model an All-around Player?
  10. Meng, et al., 2024, MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
  11. Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
  12. Chen, et al., 2023, MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
  13. Li, et al., 2023, SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
  14. Li, et al., 2023, SEED-Bench-2: Benchmarking Multimodal Large Language Models

Citation

Citation

@inproceedings{fei2024multimodal,
  title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond},
  author={Fei, Hao and Li, Xiangtai and Liu, Haotian and Liu, Fuxiao and Zhang, Zhuosheng and Zhang, Hanwang and Yan, Shuicheng},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={11289--11291},
  year={2024}
}


@inproceedings{fei2024multimodal,
  title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning, Efficiency and Beyond},
  author={Fei, Hao and Yao, Yuan and Zhang, Zhuosheng and Liu, Fuxiao and Zhang, Ao and Chua, Tat-Seng},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries},
  pages={1--8},
  year={2024}
}


@inproceedings{fei2025pathmultimodalgeneralistgenerallevel,
  title={On Path to Multimodal Generalist: General-Level and General-Bench},
  author={Hao Fei and Yuan Zhou and Juncheng Li and Xiangtai Li and Qingshan Xu and Bobo Li and Shengqiong Wu and Yaoting Wang and Junbao Zhou and Jiahao Meng and Qingyu Shi and Zhiyuan Zhou and Liangtao Shi and Minghe Gao and Daoan Zhang and Zhiqi Ge and Weiming Wu and Siliang Tang and Kaihang Pan and Yaobo Ye and Haobo Yuan and Tao Zhang and Tianjie Ju and Zixiang Meng and Shilin Xu and Liyu Jia and Wentao Hu and Meng Luo and Jiebo Luo and Tat-Seng Chua and Shuicheng Yan and Hanwang Zhang},
  booktitle={Proceedings of ICML},
  year={2025},
}


@inproceedings{yue2023mmmu,
  title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
  author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},
  booktitle={Proceedings of CVPR},
  year={2024},
}


@article{li2024survey,
  title={A survey on benchmarks of multimodal large language models},
  author={Li, Jian and Lu, Weiheng and Fei, Hao and Luo, Meng and Dai, Ming and Xia, Min and Jin, Yizhang and Gan, Zhenye and Qi, Ding and Fu, Chaoyou and others},
  journal={arXiv preprint arXiv:2408.08632},
  year={2024}
}


@article{li2024survey,
  title={A Survey on Multimodal Benchmarks: In The Era of Large AI Models},
  author={Li, Lin and Chen, Guikun and Shi, Hanrong and Xiao, Jun and Chen, Long},
  journal={arXiv preprint arXiv:2409.18142},
  year={2024}
}


Contact

Contact us

Join and post at our Google Group!
Email the organziers at mllm24@googlegroups.com .