About
MLLM Tutorial
Welcome to the MLLM Tutorial series on ACM MM 2024!
Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on following key areas: MLLM architecture, modality, functionality, instructional learning, multimodal hallucination, MLLM evaluation and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.
πNews
π₯[2024-11-03]: You can now visit the video record of the tutorial at Youtube!
π₯[2024-11-02]: We have released all the slides!
π₯[2024-10-22]: Also you may want to join our online Tutorial via this Zoom link!
π₯[2024-10-20]: For in-person attendance, please come to Meeting Room 210, at Melbourne Convention and Exhibition Centre.
π₯[2024-10-10]: This tutorial will be held on Monday 28 October, 2024.
Organizer
Presenters
Hao Fei
National University of SingaporeXiangtai Li
ByteDance/TiktokHaotian Liu
xAIFuxiao Liu
University of Maryland, College ParkZhuosheng Zhang
Shanghai Jiao Tong UniversityHanwang Zhang
Nanyang Technological UniversityKaipeng Zhang
Shanghai AI LabShuicheng Yan
Kunlun 2050 Research, Skywork AISchedule
PROGRAM
The tutorial will be held on Monday, 28 October, 2024 (all the times are based on UTC/GMT +11 = Melbourne VIC local time).
Also you can online join via Zoom Meeting
Time | Section | Presenter |
---|---|---|
09:00-09:05 | Part 1: Background and Introduction [Slides] | Hao Fei |
09:05-09:35 | Part 2: MLLM Architecture&Modality [Slides] | Hao Fei |
09:35-10:00 | Part 3: MLLM Functionality&Advances [Slides] | Xiangtai Li |
10:00-10:30 | Part 4: MLLM Instruction Tuning [Slides] | Haotian Liu |
Coffee Break, Q&A Session | ||
11:00-11:25 | Part 5: MLLM Hallucination [Slides] | Fuxiao Liu |
11:25-11:50 | Part 6: MLLM Evaluation&Generalist [Slides] | Hanwang Zhang |
11:50-12:10 | Part 7: MM Reasoning [Slides] | Zhuosheng Zhang |
12:10-12:30 | Part 8: Panel Discussion - From MM Generalist to Human-level AI | All + Kaipeng Zhang + Shuicheng Yan |
Tutorial Record
Video
Literature
Reading List
Architecture and Modality of LLMs and MLLMs
-
OpenAI, 2023, Introducing ChatGPT
-
OpenAI, 2023, GPT-4 Technical Report
-
Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
-
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
-
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
-
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
-
Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
-
Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
-
Yao, et al., 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone
-
Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
-
Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
-
Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation
-
Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
-
Zhang, et al., 2024, OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
-
Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
-
Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
-
Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
-
Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
-
Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
-
Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
-
Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
-
Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
-
Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
-
Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
-
Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
-
Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
-
Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
-
Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
-
Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
-
Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
-
Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
-
Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
-
Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
-
Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
-
Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
-
Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
-
Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
-
Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
-
Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
-
Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
-
Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
-
Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
-
Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
-
Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
-
Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
-
Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
-
Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
-
Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
-
Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
-
Frey, et al., 2023, Neural Scaling of Deep Chemical Models
-
Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
-
Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
-
Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
-
Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
-
Koh, et al., 2023, Generating Images with Multimodal Language Models
-
Sun, et al., 2023, Generative Pretraining in Multimodality
-
Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
-
Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
-
Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
-
Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
-
Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
-
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
-
Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
-
Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
-
Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
-
Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
-
Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents
Functionality and Recent Advances in MLLMs
-
Li, et al., 2022, GLIP: Grounded Language-Image Pre-training
-
Kamath, et al., 2021, MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
-
Liu, et al., 2023, Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
-
Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
-
Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
-
Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
-
Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
-
Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
-
Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
-
Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
-
Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
-
Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
-
You, et al., 2023, Ferret: Refer and Ground Anything Anywhere at Any Granularity
-
Zhang, et al., 2024, Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
-
Ma, et al., 2024, Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
-
Pramanick, et al., 2024, Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
-
Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
-
Yan, et al., 2024, VISA: Reasoning Video Object Segmentation via Large Language Models
-
Huang, et al., 2024, Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model
-
Chen, et al., 2024, Grounded 3D-LLM with Referent Tokens
-
Zhang, et al., 2024, OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
-
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
-
Xu, et al., 2024, SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
-
Shen, et al., 2024, LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
-
Lin, et al., 2024, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
-
Zong, et al., 2024, MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Instruction Tuning & Hallucination
-
Liu, et al., 2023, Visual Instruction Tuning
-
Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
-
Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
-
Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
-
Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
-
Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
-
Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
-
Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
-
Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
-
Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
-
Yin, et al., 2023, A Survey on Multimodal Large Language Models
-
Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models
MLLM Evaluation and Benchmarks
-
Fei, et al., 2024, Path to Multimodal Generalist: Levels and Benchmarks
-
Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
-
Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
-
Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
-
Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
-
Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
Multimodal Reasoning and Agent
-
Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
-
Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
-
Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
-
Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
-
Sun, et al., 2023, Generative multimodal models are in-context learners
-
Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
-
Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
-
Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
-
Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
-
Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
-
Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
-
Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
Citation
Citation
@inproceedings{fei2024multimodal,
  title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond},
  author={Fei, Hao and Li, Xiangtai and Liu, Haotian and Liu, Fuxiao and Zhang, Zhuosheng and Zhang, Hanwang and Yan, Shuicheng},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={11289--11291},
  year={2024}
}
Contact
Contact us