About
MLLM Tutorial
Welcome to the MLLM Tutorial series on LREC-COLING 2024!
Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning, multimodal reasoning, and the efficiency of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.
Organizer
Presenters
Hao Fei
National University of SingaporeYuan Yao
National University of SingaporeZhuosheng Zhang
Shanghai Jiao Tong UniversityFuxiao Liu
University of Maryland, College ParkAo Zhang
National University of SingaporeTat-seng Chua
National University of SingaporeSchedule
PROGRAM
Our tutorial will be held on Tuesday, May 21, 2024 (all the times are based on UTC + 2 = Torino local time).
Our tutorial online video record (only Part 1-3, due to technical issue) is posted on Youtube, visit here to watch.
Time | Section | Presenter |
---|---|---|
14:00-14:10 | Part 1: Background and Introduction [Slides] | Hao Fei |
14:10-15:40 | Part 2: MLLM Design: Architecture and Modality [Slides] | Hao Fei & Yuan Yao |
15:40-16:00 | Part 3: Multimodal instruction Tuning in MLLMs [Slides] | Fuxiao Liu |
Coffee Break, Q&A Session | ||
16:30-16:50 | Part 3 (Cont'd): Multimodal instruction Tuning in MLLMs | Fuxiao Liu |
16:50-17:30 | Part 4: Multimodal Reasoning in MLLMs [Slides] | Zhuosheng Zhang |
17:30-18:00 | Part 5: MLLM Efficiency [Slides] | Ao Zhang |
Literature
Reading List
Section 1: LLMs and MLLMs
-
OpenAI, 2023, Introducing ChatGPT
-
OpenAI, 2023, GPT-4 Technical Report
-
Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
-
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
-
Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
-
Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
-
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
-
Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
-
Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
-
Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
-
Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
-
Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
-
Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
-
Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
-
Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
-
Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
-
Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
-
Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
-
Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
-
Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
-
Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
-
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
-
Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
-
Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
-
Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
-
Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
-
Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
-
Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
-
Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
-
Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
-
Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
-
Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
-
Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
-
Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
-
Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
-
Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
-
Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
-
Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
-
Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
-
Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
-
Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
-
Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
-
Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
-
Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
-
Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
-
Frey, et al., 2023, Neural Scaling of Deep Chemical Models
-
Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
-
Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
-
Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
-
Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
-
Koh, et al., 2023, Generating Images with Multimodal Language Models
-
Sun, et al., 2023, Generative Pretraining in Multimodality
-
Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
-
Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
-
Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
-
Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
-
Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
-
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
-
Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
-
Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
-
Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
-
Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
-
Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
-
Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
-
Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
-
Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
-
Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
-
Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
-
Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
-
Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
-
Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
-
Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
-
Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
-
Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
-
Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
-
Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents
Section 2: Instruction Tuning
-
Liu, et al., 2023, Visual Instruction Tuning
-
Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
-
Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
-
Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
-
Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
-
Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
-
Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
-
Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
-
Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
-
Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
-
Yin, et al., 2023, A Survey on Multimodal Large Language Models
-
Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models
Section 3: Reasoning with LLM
-
Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
-
Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
-
Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
-
Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
-
Sun, et al., 2023, Generative multimodal models are in-context learners
-
Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
-
Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
-
Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
-
Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
-
Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
-
Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
-
Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Section 4: Efficient Learning
-
Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
-
Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
-
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
-
Yao, et al., 2024, MiniCPM-V
-
DeepSpeed Team, 2020, DeepSpeed Blog
-
Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
-
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
-
Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
-
Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
-
Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
-
Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
-
Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
-
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
-
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
-
Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation
Citation
Citation
@inproceedings{fei2024multimodal,
title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning, Efficiency and Beyond},
author={Fei, Hao and Yao, Yuan and Zhang, Zhuosheng and Liu, Fuxiao and Zhang, Ao and Chua, Tat-Seng},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries},
pages={1--8},
year={2024}
}
Contact
Contact us