From Multimodal LLM to Human-level AI:

Modality, Instruction, Reasoning, Efficiency and Beyond


MLLM Tutorial

Welcome to the MLLM Tutorial series on CVPR 2024!

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning&hallucination, multimodal reasoning of MLLMs and efficient learning in MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.

Seattle local time zone (UTC-7): Tuesday, June 18, 1:30 PM-6:00 PM

Beijing time zone (UTC+8): Wednesday, June 19, 4:30 AM - 9:00 AM


🔥[2024-06-19]: You can now visit the video record of the tutorial at Youtube!
🔥[2024-06-19]: We have released all the slides!
🔥[2024-06-18]: Our tutorial is about to start, at room Summit 446 for in-person attendance!
🔥[2024-06-18]: Also you may want to join our online Tutorial via this Zoom link!



Hao Fei

National University of Singapore

Yuan Yao

National University of Singapore

Ao Zhang

National University of Singapore

Haotian Liu

University of Wisconsin-Madison

Fuxiao Liu

University of Maryland, College Park

Zhuosheng Zhang

Shanghai Jiao Tong University

Hanwang Zhang

Nanyang Technological University

Shuicheng Yan

Kunlun 2050 Research, Skywork AI



Our tutorial will be held on Tuesday, June 18, 2024 (all the times are based on UTC-7 = Seattle local time).

Time Section Presenter
13:30-13:35 Part 1: Background and Introduction [Slides] Hao Fei
13:35-14:05 Part 2: MLLM Architecture [Slides] Yuan Yao
14:05-14:35 Part 3: MLLM Modality&Functionality [Slides] Hao Fei
14:35-15:05 Part 4: MLLM Instruction Tuning [Slides] Haotian Liu
Coffee Break, Q&A Session
16:00-16:30 Part 5: MLLM Hallucination [Slides] Fuxiao Liu
16:30-17:00 Part 6: MM Reasoning [Slides] Zhuosheng Zhang
17:00-17:30 Part 7: MLLM Efficiency [Slides] Ao Zhang
17:30-18:00 Part 8: Panel Discussion - From MM Generalist to Human-level AI All + Hanwang Zhang + Shuicheng Yan

Tutorial Record



Reading List

Section I: LLMs and MLLMs

  1. OpenAI, 2023, Introducing ChatGPT
  2. OpenAI, 2023, GPT-4 Technical Report
  3. Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
  4. Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  5. Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  6. Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  7. Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  8. Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
  9. Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
  10. Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
  11. Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  12. Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
  13. Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  14. Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
  15. Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
  16. Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
  17. Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
  18. Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
  19. Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
  20. Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
  21. Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
  22. Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
  23. Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
  24. Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  25. Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
  26. Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  27. Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
  28. Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
  29. Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
  30. Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
  31. Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
  32. Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  33. Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
  34. Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
  35. Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
  36. Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
  37. Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
  38. Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
  39. Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
  40. Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  41. Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
  42. Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
  43. Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
  44. Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
  45. Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
  46. Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
  47. Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
  48. Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
  49. Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
  50. Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
  51. Frey, et al., 2023, Neural Scaling of Deep Chemical Models
  52. Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
  53. Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
  54. Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
  55. Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
  56. Koh, et al., 2023, Generating Images with Multimodal Language Models
  57. Sun, et al., 2023, Generative Pretraining in Multimodality
  58. Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
  59. Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
  60. Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
  61. Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
  62. Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
  63. Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
  64. Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
  65. Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
  66. Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
  67. Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
  68. Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
  69. Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
  70. Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
  71. Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
  72. Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
  73. Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
  74. Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
  75. Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
  76. Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
  77. Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
  78. Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
  79. Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
  80. Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
  81. Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
  82. Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Section II: Instruction Tuning & Hallucination

  1. Liu, et al., 2023, Visual Instruction Tuning
  2. Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
  3. Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  4. Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
  5. Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
  6. Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
  7. Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
  8. Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  9. Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
  10. Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
  11. Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
  12. Yin, et al., 2023, A Survey on Multimodal Large Language Models
  13. Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

Section III: Reasoning with LLM

  1. Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
  2. Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
  3. Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  4. Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
  5. Sun, et al., 2023, Generative multimodal models are in-context learners
  6. Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  7. Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
  8. Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
  9. Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
  10. Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
  11. Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
  12. Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
  13. Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Section IV: Efficient Learning

  1. Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
  2. Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
  3. Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  4. Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
  5. Yao, et al., 2024, MiniCPM-V
  6. DeepSpeed Team, 2020, DeepSpeed Blog
  7. Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
  8. Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  9. Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
  10. Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
  11. Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
  12. Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
  13. Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
  14. Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
  15. Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  16. Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation


Contact us

Join and post at our Google Group!
Email the organziers at .