Multimodal & Large Language Models

Note: This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

Survey
Position Paper
Structure
Planning
Reasoning
Generation
Representation Learning
LLM Analysis
LLM Safety
LLM Evaluation
LLM Reasoning
LLM Application
LLM with Memory
LLM with Human
LLM Foundation
RAG
Scaling Law
LLM Data Engineering
VLM Data Engineering
Alignment
Scalable Oversight&SuperAlignment
RL Foundation
Beyond Bandit
Agent
Interaction
Critique Modeling
MoE/Specialized
Vision-Language Foundation Model
Vision-Language Model Analysis & Evaluation
Vision-Language Model Application
Multimodal Foundation Model
Image Generation
Diffusion
Document Understanding
Tool Learning
Instruction Tuning
Incontext Learning
Learning from Feedback
Video Foundation Model
Key Frame Detection
Pretraining
Vision Model
Adaptation of Foundation Model
Prompting
Efficiency
Analysis
Grounding
VQA Task
VQA Dataset
Social Good
Application
Benchmark & Evaluation
Dataset
Robustness
Hallucination&Factuality
Cognitive NeuronScience & Machine Learning
Theory of Mind
Cognitive NeuronScience
World Model
Resource

Survey

Multimodal Learning with Transformers: A Survey; Peng Xu, Xiatian Zhu, David A. Clifton
Multimodal Machine Learning: A Survey and Taxonomy; Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency; Introduce 4 challenges for multi-modal learning, including representation, translation, alignment, fusion, and co-learning.
FOUNDATIONS & RECENT TRENDS IN MULTIMODAL MACHINE LEARNING: PRINCIPLES, CHALLENGES, & OPEN QUESTIONS; Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
Multimodal research in vision and language: A review of current and emerging trends; Shagun Uppal et al;
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods; Aditya Mogadala et al
Challenges and Prospects in Vision and Language Research; Kushal Kafle et al
A Survey of Current Datasets for Vision and Language Research; Francis Ferraro et al
VLP: A Survey on Vision-Language Pre-training; Feilong Chen et al
A Survey on Multimodal Disinformation Detection; Firoj Alam et al
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends; Zhe Gan et al
Deep Multimodal Representation Learning: A Survey; Wenzhong Guo et al
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges; Maria Lymperaiou et al
Augmented Language Models: a Survey; Grégoire Mialon et al
Multimodal Deep Learning; Matthias Aßenmacher et al
Sparks of Artificial General Intelligence: Early experiments with GPT-4; Sebastien Bubeck et al
Retrieving Multimodal Information for Augmented Generation: A Survey; Ruochen Zhao et al
Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning; Renze Lou et al
A Survey of Large Language Models; Wayne Xin Zhao et al
Tool Learning with Foundation Models; Yujia Qin et al
A Cookbook of Self-Supervised Learning; Randall Balestriero et al
Foundation Models for Decision Making: Problems, Methods, and Opportunities; Sherry Yang et al
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation; Patrick Fernandes et al
Reasoning with Language Model Prompting: A Survey; Shuofei Qiao et al
Towards Reasoning in Large Language Models: A Survey; Jie Huang et al
Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models; Chen Ling et al
Unifying Large Language Models and Knowledge Graphs: A Roadmap; Shirui Pan et al
Interactive Natural Language Processing; Zekun Wang et al
A Survey on Multimodal Large Language Models; Shukang Yin et al
TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT; Yang Liu et al
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback; Stephen Casper et al
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies; Liangming Pan et al
Challenges and Applications of Large Language Models; Jean Kaddour et al
Aligning Large Language Models with Human: A Survey; Yufei Wang et al
Instruction Tuning for Large Language Models: A Survey; Shengyu Zhang et al
From Instructions to Intrinsic Human Values —— A Survey of Alignment Goals for Big Models; Jing Yao et al
A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation; Xiaowei Huang et al
Explainability for Large Language Models: A Survey; Haiyan Zhao et al
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models; Yue Zhang et al
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity; Cunxiang Wang et al
ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?; Hailin Chen et al
Vision-Language Instruction Tuning: A Review and Analysis; Chen Li et al
The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities; Yuxiang Zhou et al
Efficient Large Language Models: A Survey; Zhongwei Wan et al
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision); Zhengyuan Yang et al
Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents; Zhuosheng Zhang et al
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis; Yafei Hu et al
Multimodal Foundation Models: From Specialists to General-Purpose Assistants; Chunyuan Li et al
A Survey on Large Language Model based Autonomous Agents; Lei Wang et al
Video Understanding with Large Language Models: A Survey; Yunlong Tang et al
A Survey of Preference-Based Reinforcement Learning Methods; Christian Wirth et al
AI Alignment: A Comprehensive Survey; Jiaming Ji et al
A SURVEY OF REINFORCEMENT LEARNING FROM HUMAN FEEDBACK; Timo Kaufmann et al
TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS; Lichao Sun et al
AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION; Zane Durante et al
Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey; Cedric Colas et al
Safety of Multimodal Large Language Models on Images and Text; Xin Liu et al
MM-LLMs: Recent Advances in MultiModal Large Language Models; Duzhen Zhang et al
Rethinking Interpretability in the Era of Large Language Models; Chandan Singh et al
Large Multimodal Agents: A Survey; Junlin Xie et al
A Survey on Data Selection for Language Models; Alon Albalak et al
What Are Tools Anyway? A Survey from the Language Model Perspective; Zora Zhiruo Wang et al
Best Practices and Lessons Learned on Synthetic Data for Language Models; Ruibo Liu et al
A Survey on the Memory Mechanism of Large Language Model based Agents; Zeyu Zhang et al
A Survey on Self-Evolution of Large Language Models; Zhengwei Tao et al
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models; Xianzheng Ma et al
An Introduction to Vision-Language Modeling; Florian Bordes et al
Towards Scalable Automated Alignment of LLMs: A Survey; Boxi Cao et al
A Survey on Mixture of Experts; Weilin Cai et al
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective; Zhen Qin et al
Retrieval-Augmented Generation for Large Language Models: A Survey; Yunfan Gao et al

Position Paper

Eight Things to Know about Large Language Models; Samuel R. Bowman et al
A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models; Oana Ignat et al
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models; Yuxi Ma et al
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models; Lingxi Xie et al
A Path Towards Autonomous Machine Intelligence; Yann LeCun et al
GPT-4 Can’t Reason; Konstantine Arkoudas et al
Cognitive Architectures for Language Agents; Theodore Sumers et al
Large Search Model: Redefining Search Stack in the Era of LLMs; Liang Wang et al
PROAGENT: FROM ROBOTIC PROCESS AUTOMATION TO AGENTIC PROCESS AUTOMATION; Yining Ye et al
Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning; Zhiting Hu et al
A Roadmap to Pluralistic Alignment; Taylor Sorensen et al
Towards Unified Alignment Between Agents, Humans, and Environment; Zonghan Yang et al
Video as the New Language for Real-World Decision Making; Sherry Yang et al
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI; Seliem El-Sayed et al

Structure

Finding Structural Knowledge in Multimodal-BERT; Victor Milewski et al
Going Beyond Nouns With Vision & Language Models Using Synthetic Data; Paola Cascante-Bonilla et al
Measuring Progress in Fine-grained Vision-and-Language Understanding; Emanuele Bugliarello et al
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction; Hejie Cui et al

Event Extraction

Cross-media Structured Common Space for Multimedia Event Extraction; Manling Li et al; Focus on image-text event extraction. A new benchmark and baseline are proposed.
Visual Semantic Role Labeling for Video Understanding; Arka Sadhu et al; A new benchmark is proposed.
GAIA: A Fine-grained Multimedia Knowledge Extraction System; Manling Li et al; Demo paper. Extract knowledge (relation, event) from multimedia data.
MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities; Yubo Ma et al

Situation Recognition

Situation Recognition: Visual Semantic Role Labeling for Image Understanding; Mark Yatskar et al; Focus on image understanding. Given images, do the semantic role labeling task. No text available. A new benchmark and baseline are proposed.
Commonly Uncommon: Semantic Sparsity in Situation Recognition; Mark Yatskar et al; Address the long-tail problem.
Grounded Situation Recognition; Sarah Pratt et al
Rethinking the Two-Stage Framework for Grounded Situation Recognition; Meng Wei et al
Collaborative Transformers for Grounded Situation Recognition; Junhyeong Cho et al

Scene Graph

Action Genome: Actions as Composition of Spatio-temporal Scene Graphs; Jingwei Ji et al; Spatio-temporal scene graphs (video).
Unbiased Scene Graph Generation from Biased Training; Kaihua Tang et al
Visual Distant Supervision for Scene Graph Generation; Yuan Yao et al
Learning to Generate Scene Graph from Natural Language Supervision; Yiwu Zhong et al
Weakly Supervised Visual Semantic Parsing; Alireza Zareian, Svebor Karaman, Shih-Fu Chang
Scene Graph Prediction with Limited Labels; Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, Li Fei-Fei
Neural Motifs: Scene Graph Parsing with Global Context; Rowan Zellers et al
Fine-Grained Scene Graph Generation with Data Transfer; Ao Zhang et al
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning; Tao He et al
COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION; Kaifeng Gao et al; Video.
LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation; Xiaoguang Chang et al
TRANSFORMER-BASED IMAGE GENERATION FROM SCENE GRAPHS; Renato Sortino et al
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation; Lin Li et al
Knowledge-augmented Few-shot Visual Relation Detection; Tianyu Yu et al
Prototype-based Embedding Network for Scene Graph Generation; Chaofan Zhen et al
Unified Visual Relationship Detection with Vision and Language Models; Long Zhao et al
Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge; Yufeng Huang et al

Attribute

COCO Attributes: Attributes for People, Animals, and Objects; Genevieve Patterson et al
Human Attribute Recognition by Deep Hierarchical Contexts; Yining Li et al; Attribute prediction in specific domains.
Emotion Recognition in Context; Ronak Kosti et al; Attribute prediction in specific domains.
The iMaterialist Fashion Attribute Dataset; Sheng Guo et al; Attribute prediction in specific domains.
Learning to Predict Visual Attributes in the Wild; Khoi Pham et al
Open-vocabulary Attribute Detection; Marıa A. Bravo et al
OvarNet: Towards Open-vocabulary Object Attribute Recognition; Keyan Chen et al

Compositionality

CREPE: Can Vision-Language Foundation Models Reason Compositionally?; Zixian Ma et al
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality; Tristan Thrush et al
WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?; Mert Yuksekgonul et al
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering; Drew A. Hudson et al
COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images; Ben Bogin et al
Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension; Zhenfang Chen et al
**Do

Multimodal-AND-Large-Language-Models

Multimodal & Large Language Models

Table of Contents

Survey

Position Paper

Structure