From Multimodal LLM to Human-level AI:

Architecture, Modality, Function, Instruction, Hallucination, Evaluation, Reasoning and Beyond

28 October - 1 November 2024, Melbourne, Australia

About

MLLM Tutorial

Welcome to the MLLM Tutorial series on ACM MM 2024!

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on following key areas: MLLM architecture, modality, functionality, instructional learning, multimodal hallucination, MLLM evaluation and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.



Organizer

Presenters

https://haofei.vip/

Hao Fei

National University of Singapore

Xiangtai Li

ByteDance/Tiktok
https://haofei.vip/

Fuxiao Liu

University of Maryland, College Park

Zhuosheng Zhang

Shanghai Jiao Tong University

Hanwang Zhang

Nanyang Technological University

Shuicheng Yan

Kunlun 2050 Research, Skywork AI

Schedule

PROGRAM

The tutorial has the following tentative schedule, which will be updated later.



Time Section Presenter
13:30-13:35 Part 1: Background and Introduction [Slides] Hao Fei
13:35-14:05 Part 2: MLLM Architecture&Modality [Slides] Hao Fei
14:05-14:35 Part 3: MLLM Functionality&Advances [Slides] Xiangtai Li
14:35-15:05 Part 4: MLLM Instruction Tuning [Slides] Haotian Liu
Coffee Break, Q&A Session
16:00-16:30 Part 5: MLLM Hallucination [Slides] Fuxiao Liu
16:30-17:00 Part 6: MLLM Evaluation [Slides] Hanwang Zhang
17:00-17:30 Part 7: MM Reasoning [Slides] Zhuosheng Zhang
17:30-18:00 Part 8: Panel Discussion - From MM Generalist to Human-level AI All + Shuicheng Yan

Tutorial Record

Video

TBD

Literature

Reading List

Architecture and Modality of LLMs and MLLMs

  1. OpenAI, 2023, Introducing ChatGPT
  2. OpenAI, 2023, GPT-4 Technical Report
  3. Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
  4. Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  5. Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  6. Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  7. Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  8. Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
  9. Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  10. Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
  11. Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
  12. Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  13. Yao, et al., 2024, MiniCPM-V: A GPT-4V Level MLLM on Your Phone
  14. Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
  15. Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
  16. Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation
  17. Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
  18. Zhang, et al., 2024, OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
  19. Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  20. Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
  21. Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
  22. Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
  23. Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
  24. Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
  25. Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
  26. Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
  27. Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
  28. Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
  29. Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
  30. Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  31. Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
  32. Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
  33. Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
  34. Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
  35. Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
  36. Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
  37. Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  38. Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
  39. Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
  40. Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
  41. Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
  42. Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
  43. Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
  44. Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
  45. Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  46. Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
  47. Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
  48. Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
  49. Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
  50. Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
  51. Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
  52. Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
  53. Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
  54. Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
  55. Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
  56. Frey, et al., 2023, Neural Scaling of Deep Chemical Models
  57. Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
  58. Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
  59. Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
  60. Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
  61. Koh, et al., 2023, Generating Images with Multimodal Language Models
  62. Sun, et al., 2023, Generative Pretraining in Multimodality
  63. Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
  64. Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
  65. Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
  66. Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
  67. Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
  68. Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
  69. Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
  70. Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
  71. Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
  72. Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
  73. Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
  74. Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
  75. Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
  76. Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
  77. Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
  78. Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
  79. Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
  80. Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
  81. Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
  82. Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
  83. Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
  84. Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
  85. Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
  86. Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
  87. Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Functionality and Recent Advances in MLLMs

  1. OpenAI, 2023, Introducing ChatGPT

Instruction Tuning & Hallucination

  1. Liu, et al., 2023, Visual Instruction Tuning
  2. Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
  3. Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  4. Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
  5. Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
  6. Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
  7. Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
  8. Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  9. Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
  10. Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
  11. Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
  12. Yin, et al., 2023, A Survey on Multimodal Large Language Models
  13. Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

MLLM Evaluation and Benchmarks

  1. OpenAI, 2023, Introducing ChatGPT

Multimodal Reasoning and Agent

  1. Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
  2. Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
  3. Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  4. Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
  5. Sun, et al., 2023, Generative multimodal models are in-context learners
  6. Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
  7. Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
  8. Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
  9. Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
  10. Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
  11. Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
  12. Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
  13. Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
  14. Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents

Citation

Citation

@inproceedings{mllm2024acmmm,
title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond},
author={Fei, Hao and Li, Xiangtai and Liu, Haotian and Liu, Fuxiao and Zhang, Zhuosheng and Zhang, Hanwang and Yan, Shuicheng},
booktitle={Proceedings of the 32st ACM International Conference on Multimedia: Tutorial Summaries},
pages={1--3},
year={2024}
}


Contact

Contact us

Join and post at our Google Group!
Email the organziers at mllm24@googlegroups.com .