MLLM Tutorial @ LREC-COLING 2024

From Multimodal LLM to Human-level AI:

Modality, Instruction, Reasoning, Efficiency and Beyond

About

MLLM Tutorial

Welcome to the MLLM Tutorial series on LREC-COLING 2024!

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning, multimodal reasoning, and the efficiency of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.

Organizer

Presenters

https://haofei.vip/

Hao Fei

National University of Singapore

Yuan Yao

National University of Singapore

Zhuosheng Zhang

Shanghai Jiao Tong University

https://haofei.vip/

Fuxiao Liu

University of Maryland, College Park

Ao Zhang

National University of Singapore

Tat-seng Chua

National University of Singapore

Schedule

PROGRAM

Our tutorial will be held on Tuesday, May 21, 2024 (all the times are based on UTC + 2 = Torino local time).

Our tutorial online video record (only Part 1-3, due to technical issue) is posted on Youtube, visit here to watch.

Time	Section	Presenter
14:00-14:10	Part 1: Background and Introduction [Slides]	Hao Fei
14:10-15:40	Part 2: MLLM Design: Architecture and Modality [Slides]	Hao Fei & Yuan Yao
15:40-16:00	Part 3: Multimodal instruction Tuning in MLLMs [Slides]	Fuxiao Liu
	Coffee Break, Q&A Session
16:30-16:50	Part 3 (Cont'd): Multimodal instruction Tuning in MLLMs	Fuxiao Liu
16:50-17:30	Part 4: Multimodal Reasoning in MLLMs [Slides]	Zhuosheng Zhang
17:30-18:00	Part 5: MLLM Efficiency [Slides]	Ao Zhang

Literature

Reading List

Section 1: LLMs and MLLMs

OpenAI, 2023, Introducing ChatGPT
OpenAI, 2023, GPT-4 Technical Report
Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
Frey, et al., 2023, Neural Scaling of Deep Chemical Models
Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
Koh, et al., 2023, Generating Images with Multimodal Language Models
Sun, et al., 2023, Generative Pretraining in Multimodality
Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Section 2: Instruction Tuning

Liu, et al., 2023, Visual Instruction Tuning
Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Yin, et al., 2023, A Survey on Multimodal Large Language Models
Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

Section 3: Reasoning with LLM

Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
Sun, et al., 2023, Generative multimodal models are in-context learners
Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Section 4: Efficient Learning

Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Yao, et al., 2024, MiniCPM-V
DeepSpeed Team, 2020, DeepSpeed Blog
Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation

Citation

Citation

@inproceedings{fei2024multimodal,
title={From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning, Efficiency and Beyond},
author={Fei, Hao and Yao, Yuan and Zhang, Zhuosheng and Liu, Fuxiao and Zhang, Ao and Chua, Tat-Seng},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries},
pages={1--8},
year={2024}
}

Contact

Contact us

Join and post at our Google Group!

Email the organziers at mllm24@googlegroups.com .