From Multimodal LLM to Human-level AI:

Architecture, Modality, Function, Instruction, Hallucination, Evaluation, Reasoning and Beyond

28 October - 1 November 2024, Melbourne, Australia

About

MLLM Tutorial

Welcome to the MLLM Tutorial series on ACM MM 2024!

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on following key areas: MLLM architecture, modality, functionality, instructional learning, multimodal hallucination, MLLM evaluation and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.





πŸ””News

πŸ”₯[2024-11-03]: You can now visit the video record of the tutorial at Youtube!
πŸ”₯[2024-11-02]: We have released all the slides!
πŸ”₯[2024-10-22]: Also you may want to join our online Tutorial via this Zoom link!
πŸ”₯[2024-10-20]: For in-person attendance, please come to Meeting Room 210, at Melbourne Convention and Exhibition Centre.
πŸ”₯[2024-10-10]: This tutorial will be held on Monday 28 October, 2024.


Organizer

Presenters

https://haofei.vip/

Hao Fei

National University of Singapore

Xiangtai Li

ByteDance/Tiktok
https://haofei.vip/

Fuxiao Liu

University of Maryland, College Park

Zhuosheng Zhang

Shanghai Jiao Tong University

Hanwang Zhang

Nanyang Technological University

Kaipeng Zhang

Shanghai AI Lab

Shuicheng Yan

Kunlun 2050 Research, Skywork AI

Schedule

PROGRAM

The tutorial will be held on Monday, 28 October, 2024 (all the times are based on UTC/GMT +11 = Melbourne VIC local time).

Also you can online join via Zoom Meeting



Time Section Presenter
09:00-09:05 Part 1: Background and Introduction [Slides] Hao Fei
09:05-09:35 Part 2: MLLM Architecture&Modality [Slides] Hao Fei
09:35-10:00 Part 3: MLLM Functionality&Advances [Slides] Xiangtai Li
10:00-10:30 Part 4: MLLM Instruction Tuning [Slides] Haotian Liu
Coffee Break, Q&A Session
11:00-11:25 Part 5: MLLM Hallucination [Slides] Fuxiao Liu
11:25-11:50 Part 6: MLLM Evaluation&Generalist [Slides] Hanwang Zhang
11:50-12:10 Part 7: MM Reasoning [Slides] Zhuosheng Zhang
12:10-12:30 Part 8: Panel Discussion - From MM Generalist to Human-level AI All + Kaipeng Zhang + Shuicheng Yan

Tutorial Record

Video

Literature

Reading List

Architecture and Modality of LLMs and MLLMs

  1. OpenAI, 2023, Introducing ChatGPT
  2. OpenAI, 2023, GPT-4 Technical Report
  3. Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
  4. Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  5. Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  6. Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  7. Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  8. Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
  9. Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  10. Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
  11. Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
  12. Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  13. Yao, et al., 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone
  14. Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
  15. Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
  16. Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation
  17. Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
  18. Zhang, et al., 2024, OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
  19. Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  20. Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
  21. Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
  22. Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
  23. Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
  24. Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
  25. Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
  26. Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
  27. Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
  28. Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
  29. Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  30. Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
  31. Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
  32. Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
  33. Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
  34. Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
  35. Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
  36. Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  37. Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
  38. Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
  39. Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
  40. Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
  41. Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
  42. Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
  43. Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
  44. Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  45. Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
  46. Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
  47. Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
  48. Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
  49. Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
  50. Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
  51. Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
  52. Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
  53. Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
  54. Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
  55. Frey, et al., 2023, Neural Scaling of Deep Chemical Models
  56. Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
  57. Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
  58. Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
  59. Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
  60. Koh, et al., 2023, Generating Images with Multimodal Language Models
  61. Sun, et al., 2023, Generative Pretraining in Multimodality
  62. Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
  63. Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
  64. Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
  65. Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
  66. Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
  67. Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
  68. Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
  69. Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
  70. Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
  71. Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
  72. Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
  73. Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Functionality and Recent Advances in MLLMs

  1. Li, et al., 2022, GLIP: Grounded Language-Image Pre-training
  2. Kamath, et al., 2021, MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
  3. Liu, et al., 2023, Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  4. Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
  5. Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
  6. Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
  7. Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
  8. Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
  9. Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
  10. Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
  11. Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
  12. Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
  13. You, et al., 2023, Ferret: Refer and Ground Anything Anywhere at Any Granularity
  14. Zhang, et al., 2024, Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
  15. Ma, et al., 2024, Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
  16. Pramanick, et al., 2024, Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
  17. Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  18. Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
  19. Yan, et al., 2024, VISA: Reasoning Video Object Segmentation via Large Language Models
  20. Huang, et al., 2024, Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model
  21. Chen, et al., 2024, Grounded 3D-LLM with Referent Tokens
  22. Zhang, et al., 2024, OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
  23. Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
  24. Xu, et al., 2024, SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
  25. Shen, et al., 2024, LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
  26. Lin, et al., 2024, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
  27. Zong, et al., 2024, MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Instruction Tuning & Hallucination

  1. Liu, et al., 2023, Visual Instruction Tuning
  2. Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
  3. Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  4. Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
  5. Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
  6. Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
  7. Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
  8. Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  9. Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
  10. Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
  11. Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
  12. Yin, et al., 2023, A Survey on Multimodal Large Language Models
  13. Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

MLLM Evaluation and Benchmarks

  1. Fei, et al., 2024, Path to Multimodal Generalist: Levels and Benchmarks
  2. Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
  3. Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
  4. Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
  5. Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
  6. Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence

Multimodal Reasoning and Agent

  1. Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
  2. Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
  3. Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  4. Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
  5. Sun, et al., 2023, Generative multimodal models are in-context learners
  6. Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
  7. Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
  8. Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
  9. Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
  10. Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
  11. Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
  12. Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents

Citation

Citation

@inproceedings{fei2024multimodal,
  title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond},
  author={Fei, Hao and Li, Xiangtai and Liu, Haotian and Liu, Fuxiao and Zhang, Zhuosheng and Zhang, Hanwang and Yan, Shuicheng},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={11289--11291},
  year={2024}
}


Contact

Contact us

Join and post at our Google Group!
Email the organziers at mllm24@googlegroups.com .