Multimodal & Large Language Models
Note: This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!
Table of Contents
- Survey
- Position Paper
- Structure
- Planning
- Reasoning
- Generation
- Representation Learning
- LLM Analysis
- LLM Safety
- LLM Evaluation
- LLM Reasoning
- LLM Application
- LLM with Memory
- LLM with Human
- LLM Foundation
- RAG
- Scaling Law
- LLM Data Engineering
- VLM Data Engineering
- Alignment
- Scalable Oversight&SuperAlignment
- RL Foundation
- Beyond Bandit
- Agent
- Interaction
- Critique Modeling
- MoE/Specialized
- Vision-Language Foundation Model
- Vision-Language Model Analysis & Evaluation
- Vision-Language Model Application
- Multimodal Foundation Model
- Image Generation
- Diffusion
- Document Understanding
- Tool Learning
- Instruction Tuning
- Incontext Learning
- Learning from Feedback
- Video Foundation Model
- Key Frame Detection
- Pretraining
- Vision Model
- Adaptation of Foundation Model
- Prompting
- Efficiency
- Analysis
- Grounding
- VQA Task
- VQA Dataset
- Social Good
- Application
- Benchmark & Evaluation
- Dataset
- Robustness
- Hallucination&Factuality
- Cognitive NeuronScience & Machine Learning
- Theory of Mind
- Cognitive NeuronScience
- World Model
- Resource
Survey
- Multimodal Learning with Transformers: A Survey; Peng Xu, Xiatian Zhu, David A. Clifton
- Multimodal Machine Learning: A Survey and Taxonomy; Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency; Introduce 4 challenges for multi-modal learning, including representation, translation, alignment, fusion, and co-learning.
- FOUNDATIONS & RECENT TRENDS IN MULTIMODAL MACHINE LEARNING: PRINCIPLES, CHALLENGES, & OPEN QUESTIONS; Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
- Multimodal research in vision and language: A review of current and emerging trends; Shagun Uppal et al;
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods; Aditya Mogadala et al
- Challenges and Prospects in Vision and Language Research; Kushal Kafle et al
- A Survey of Current Datasets for Vision and Language Research; Francis Ferraro et al
- VLP: A Survey on Vision-Language Pre-training; Feilong Chen et al
- A Survey on Multimodal Disinformation Detection; Firoj Alam et al
- Vision-Language Pre-training: Basics, Recent Advances, and Future Trends; Zhe Gan et al
- Deep Multimodal Representation Learning: A Survey; Wenzhong Guo et al
- The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges; Maria Lymperaiou et al
- Augmented Language Models: a Survey; Grégoire Mialon et al
- Multimodal Deep Learning; Matthias Aßenmacher et al
- Sparks of Artificial General Intelligence: Early experiments with GPT-4; Sebastien Bubeck et al
- Retrieving Multimodal Information for Augmented Generation: A Survey; Ruochen Zhao et al
- Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning; Renze Lou et al
- A Survey of Large Language Models; Wayne Xin Zhao et al
- Tool Learning with Foundation Models; Yujia Qin et al
- A Cookbook of Self-Supervised Learning; Randall Balestriero et al
- Foundation Models for Decision Making: Problems, Methods, and Opportunities; Sherry Yang et al
- Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation; Patrick Fernandes et al
- Reasoning with Language Model Prompting: A Survey; Shuofei Qiao et al
- Towards Reasoning in Large Language Models: A Survey; Jie Huang et al
- Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models; Chen Ling et al
- Unifying Large Language Models and Knowledge Graphs: A Roadmap; Shirui Pan et al
- Interactive Natural Language Processing; Zekun Wang et al
- A Survey on Multimodal Large Language Models; Shukang Yin et al
- TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT; Yang Liu et al
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback; Stephen Casper et al
- Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies; Liangming Pan et al
- Challenges and Applications of Large Language Models; Jean Kaddour et al
- Aligning Large Language Models with Human: A Survey; Yufei Wang et al
- Instruction Tuning for Large Language Models: A Survey; Shengyu Zhang et al
- From Instructions to Intrinsic Human Values —— A Survey of Alignment Goals for Big Models; Jing Yao et al
- A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation; Xiaowei Huang et al
- Explainability for Large Language Models: A Survey; Haiyan Zhao et al
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models; Yue Zhang et al
- Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity; Cunxiang Wang et al
- ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?; Hailin Chen et al
- Vision-Language Instruction Tuning: A Review and Analysis; Chen Li et al
- The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities; Yuxiang Zhou et al
- Efficient Large Language Models: A Survey; Zhongwei Wan et al
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision); Zhengyuan Yang et al
- Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents; Zhuosheng Zhang et al
- Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis; Yafei Hu et al
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants; Chunyuan Li et al
- A Survey on Large Language Model based Autonomous Agents; Lei Wang et al
- Video Understanding with Large Language Models: A Survey; Yunlong Tang et al
- A Survey of Preference-Based Reinforcement Learning Methods; Christian Wirth et al
- AI Alignment: A Comprehensive Survey; Jiaming Ji et al
- A SURVEY OF REINFORCEMENT LEARNING FROM HUMAN FEEDBACK; Timo Kaufmann et al
- TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS; Lichao Sun et al
- AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION; Zane Durante et al
- Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey; Cedric Colas et al
- Safety of Multimodal Large Language Models on Images and Text; Xin Liu et al
- MM-LLMs: Recent Advances in MultiModal Large Language Models; Duzhen Zhang et al
- Rethinking Interpretability in the Era of Large Language Models; Chandan Singh et al
- Large Multimodal Agents: A Survey; Junlin Xie et al
- A Survey on Data Selection for Language Models; Alon Albalak et al
- What Are Tools Anyway? A Survey from the Language Model Perspective; Zora Zhiruo Wang et al
- Best Practices and Lessons Learned on Synthetic Data for Language Models; Ruibo Liu et al
- A Survey on the Memory Mechanism of Large Language Model based Agents; Zeyu Zhang et al
- A Survey on Self-Evolution of Large Language Models; Zhengwei Tao et al
- When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models; Xianzheng Ma et al
- An Introduction to Vision-Language Modeling; Florian Bordes et al
- Towards Scalable Automated Alignment of LLMs: A Survey; Boxi Cao et al
- A Survey on Mixture of Experts; Weilin Cai et al
- The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective; Zhen Qin et al
- Retrieval-Augmented Generation for Large Language Models: A Survey; Yunfan Gao et al
Position Paper
- Eight Things to Know about Large Language Models; Samuel R. Bowman et al
- A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models; Oana Ignat et al
- Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models; Yuxi Ma et al
- Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models; Lingxi Xie et al
- A Path Towards Autonomous Machine Intelligence; Yann LeCun et al
- GPT-4 Can’t Reason; Konstantine Arkoudas et al
- Cognitive Architectures for Language Agents; Theodore Sumers et al
- Large Search Model: Redefining Search Stack in the Era of LLMs; Liang Wang et al
- PROAGENT: FROM ROBOTIC PROCESS AUTOMATION TO AGENTIC PROCESS AUTOMATION; Yining Ye et al
- Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning; Zhiting Hu et al
- A Roadmap to Pluralistic Alignment; Taylor Sorensen et al
- Towards Unified Alignment Between Agents, Humans, and Environment; Zonghan Yang et al
- Video as the New Language for Real-World Decision Making; Sherry Yang et al
- A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI; Seliem El-Sayed et al
Structure
- Finding Structural Knowledge in Multimodal-BERT; Victor Milewski et al
- Going Beyond Nouns With Vision & Language Models Using Synthetic Data; Paola Cascante-Bonilla et al
- Measuring Progress in Fine-grained Vision-and-Language Understanding; Emanuele Bugliarello et al
- PV2TEA: Patching Visual Modality to Textual-Established Information Extraction; Hejie Cui et al
Event Extraction
- Cross-media Structured Common Space for Multimedia Event Extraction; Manling Li et al; Focus on image-text event extraction. A new benchmark and baseline are proposed.
- Visual Semantic Role Labeling for Video Understanding; Arka Sadhu et al; A new benchmark is proposed.
- GAIA: A Fine-grained Multimedia Knowledge Extraction System; Manling Li et al; Demo paper. Extract knowledge (relation, event) from multimedia data.
- MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities; Yubo Ma et al
Situation Recognition
- Situation Recognition: Visual Semantic Role Labeling for Image Understanding; Mark Yatskar et al; Focus on image understanding. Given images, do the semantic role labeling task. No text available. A new benchmark and baseline are proposed.
- Commonly Uncommon: Semantic Sparsity in Situation Recognition; Mark Yatskar et al; Address the long-tail problem.
- Grounded Situation Recognition; Sarah Pratt et al
- Rethinking the Two-Stage Framework for Grounded Situation Recognition; Meng Wei et al
- Collaborative Transformers for Grounded Situation Recognition; Junhyeong Cho et al
Scene Graph
- Action Genome: Actions as Composition of Spatio-temporal Scene Graphs; Jingwei Ji et al; Spatio-temporal scene graphs (video).
- Unbiased Scene Graph Generation from Biased Training; Kaihua Tang et al
- Visual Distant Supervision for Scene Graph Generation; Yuan Yao et al
- Learning to Generate Scene Graph from Natural Language Supervision; Yiwu Zhong et al
- Weakly Supervised Visual Semantic Parsing; Alireza Zareian, Svebor Karaman, Shih-Fu Chang
- Scene Graph Prediction with Limited Labels; Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, Li Fei-Fei
- Neural Motifs: Scene Graph Parsing with Global Context; Rowan Zellers et al
- Fine-Grained Scene Graph Generation with Data Transfer; Ao Zhang et al
- Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning; Tao He et al
- COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION; Kaifeng Gao et al; Video.
- LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation; Xiaoguang Chang et al
- TRANSFORMER-BASED IMAGE GENERATION FROM SCENE GRAPHS; Renato Sortino et al
- The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation; Lin Li et al
- Knowledge-augmented Few-shot Visual Relation Detection; Tianyu Yu et al
- Prototype-based Embedding Network for Scene Graph Generation; Chaofan Zhen et al
- Unified Visual Relationship Detection with Vision and Language Models; Long Zhao et al
- Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge; Yufeng Huang et al
Attribute
- COCO Attributes: Attributes for People, Animals, and Objects; Genevieve Patterson et al
- Human Attribute Recognition by Deep Hierarchical Contexts; Yining Li et al; Attribute prediction in specific domains.
- Emotion Recognition in Context; Ronak Kosti et al; Attribute prediction in specific domains.
- The iMaterialist Fashion Attribute Dataset; Sheng Guo et al; Attribute prediction in specific domains.
- Learning to Predict Visual Attributes in the Wild; Khoi Pham et al
- Open-vocabulary Attribute Detection; Marıa A. Bravo et al
- OvarNet: Towards Open-vocabulary Object Attribute Recognition; Keyan Chen et al
Compositionality
- CREPE: Can Vision-Language Foundation Models Reason Compositionally?; Zixian Ma et al
- Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality; Tristan Thrush et al
- WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?; Mert Yuksekgonul et al
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering; Drew A. Hudson et al
- COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images; Ben Bogin et al
- Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension; Zhenfang Chen et al
- **Do