From Multimodal LLM to Human-level AI:

Modality, Instruction, Reasoning, Efficiency and Beyond

About

MLLM Tutorial

Welcome to the MLLM Tutorial series on LREC-COLING 2024!

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning, multimodal reasoning, and the efficiency of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.

Organizer

Presenters

https://haofei.vip/

Hao Fei

National University of Singapore

Yuan Yao

National University of Singapore

Zhuosheng Zhang

Shanghai Jiao Tong University

https://haofei.vip/

Fuxiao Liu

University of Maryland, College Park

Ao Zhang

National University of Singapore

Tat-seng Chua

National University of Singapore

Schedule

PROGRAM

TBD

Literature

Reading List

Section 1: LLMs and MLLMs

  1. OpenAI, 2023, Introducing ChatGPT
  2. OpenAI, 2023, GPT-4 Technical Report
  3. Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
  4. Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  5. Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  6. Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  7. Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  8. Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
  9. Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
  10. Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
  11. Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  12. Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
  13. Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  14. Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
  15. Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
  16. Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Section 2: Instruction Tuning

  1. Liu, et al., 2023, Visual Instruction Tuning
  2. Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
  3. Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  4. Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
  5. Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
  6. Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
  7. Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
  8. Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Section 3: Reasoning with LLM

  1. Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
  2. Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
  3. Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  4. Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents

Section 4: Efficient Learning

  1. Hu, et al., 2023, LoRA: Low-Rank Adaptation of Large Language Models
  2. Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
  3. Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
  4. Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs

Contact

Contact us

Join and post at our Google Group!
Email the organziers at mllm24@googlegroups.com .