Project Icon

Multimodal-AND-Large-Language-Models

多模态与大语言模型前沿研究综述

本项目汇总了多模态和大语言模型领域的最新研究进展,涵盖结构化知识提取、事件抽取、场景图生成和属性识别等核心技术。同时探讨了视觉语言模型在推理、组合性和开放词汇等方面的前沿问题。项目还收录了大量相关综述和立场文章,为研究人员提供全面的领域概览和未来方向参考。

Multimodal & Large Language Models

Note: This paper list is only used to record papers I read in the daily arxiv for personal needs. I only subscribe and cover the following subjects: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). If you find I missed some important and exciting work, it would be super helpful to let me know. Thanks!

Table of Contents

Survey

  • Multimodal Learning with Transformers: A Survey; Peng Xu, Xiatian Zhu, David A. Clifton
  • Multimodal Machine Learning: A Survey and Taxonomy; Tadas Baltrusaitis, Chaitanya Ahuja, Louis-Philippe Morency; Introduce 4 challenges for multi-modal learning, including representation, translation, alignment, fusion, and co-learning.
  • FOUNDATIONS & RECENT TRENDS IN MULTIMODAL MACHINE LEARNING: PRINCIPLES, CHALLENGES, & OPEN QUESTIONS; Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
  • Multimodal research in vision and language: A review of current and emerging trends; Shagun Uppal et al;
  • Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods; Aditya Mogadala et al
  • Challenges and Prospects in Vision and Language Research; Kushal Kafle et al
  • A Survey of Current Datasets for Vision and Language Research; Francis Ferraro et al
  • VLP: A Survey on Vision-Language Pre-training; Feilong Chen et al
  • A Survey on Multimodal Disinformation Detection; Firoj Alam et al
  • Vision-Language Pre-training: Basics, Recent Advances, and Future Trends; Zhe Gan et al
  • Deep Multimodal Representation Learning: A Survey; Wenzhong Guo et al
  • The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges; Maria Lymperaiou et al
  • Augmented Language Models: a Survey; Grégoire Mialon et al
  • Multimodal Deep Learning; Matthias Aßenmacher et al
  • Sparks of Artificial General Intelligence: Early experiments with GPT-4; Sebastien Bubeck et al
  • Retrieving Multimodal Information for Augmented Generation: A Survey; Ruochen Zhao et al
  • Is Prompt All You Need? No. A Comprehensive and Broader View of Instruction Learning; Renze Lou et al
  • A Survey of Large Language Models; Wayne Xin Zhao et al
  • Tool Learning with Foundation Models; Yujia Qin et al
  • A Cookbook of Self-Supervised Learning; Randall Balestriero et al
  • Foundation Models for Decision Making: Problems, Methods, and Opportunities; Sherry Yang et al
  • Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation; Patrick Fernandes et al
  • Reasoning with Language Model Prompting: A Survey; Shuofei Qiao et al
  • Towards Reasoning in Large Language Models: A Survey; Jie Huang et al
  • Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models; Chen Ling et al
  • Unifying Large Language Models and Knowledge Graphs: A Roadmap; Shirui Pan et al
  • Interactive Natural Language Processing; Zekun Wang et al
  • A Survey on Multimodal Large Language Models; Shukang Yin et al
  • TRUSTWORTHY LLMS: A SURVEY AND GUIDELINE FOR EVALUATING LARGE LANGUAGE MODELS’ ALIGNMENT; Yang Liu et al
  • Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback; Stephen Casper et al
  • Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies; Liangming Pan et al
  • Challenges and Applications of Large Language Models; Jean Kaddour et al
  • Aligning Large Language Models with Human: A Survey; Yufei Wang et al
  • Instruction Tuning for Large Language Models: A Survey; Shengyu Zhang et al
  • From Instructions to Intrinsic Human Values —— A Survey of Alignment Goals for Big Models; Jing Yao et al
  • A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation; Xiaowei Huang et al
  • Explainability for Large Language Models: A Survey; Haiyan Zhao et al
  • Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models; Yue Zhang et al
  • Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity; Cunxiang Wang et al
  • ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?; Hailin Chen et al
  • Vision-Language Instruction Tuning: A Review and Analysis; Chen Li et al
  • The Mystery and Fascination of LLMs: A Comprehensive Survey on the Interpretation and Analysis of Emergent Abilities; Yuxiang Zhou et al
  • Efficient Large Language Models: A Survey; Zhongwei Wan et al
  • The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision); Zhengyuan Yang et al
  • Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents; Zhuosheng Zhang et al
  • Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis; Yafei Hu et al
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants; Chunyuan Li et al
  • A Survey on Large Language Model based Autonomous Agents; Lei Wang et al
  • Video Understanding with Large Language Models: A Survey; Yunlong Tang et al
  • A Survey of Preference-Based Reinforcement Learning Methods; Christian Wirth et al
  • AI Alignment: A Comprehensive Survey; Jiaming Ji et al
  • A SURVEY OF REINFORCEMENT LEARNING FROM HUMAN FEEDBACK; Timo Kaufmann et al
  • TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS; Lichao Sun et al
  • AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION; Zane Durante et al
  • Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey; Cedric Colas et al
  • Safety of Multimodal Large Language Models on Images and Text; Xin Liu et al
  • MM-LLMs: Recent Advances in MultiModal Large Language Models; Duzhen Zhang et al
  • Rethinking Interpretability in the Era of Large Language Models; Chandan Singh et al
  • Large Multimodal Agents: A Survey; Junlin Xie et al
  • A Survey on Data Selection for Language Models; Alon Albalak et al
  • What Are Tools Anyway? A Survey from the Language Model Perspective; Zora Zhiruo Wang et al
  • Best Practices and Lessons Learned on Synthetic Data for Language Models; Ruibo Liu et al
  • A Survey on the Memory Mechanism of Large Language Model based Agents; Zeyu Zhang et al
  • A Survey on Self-Evolution of Large Language Models; Zhengwei Tao et al
  • When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models; Xianzheng Ma et al
  • An Introduction to Vision-Language Modeling; Florian Bordes et al
  • Towards Scalable Automated Alignment of LLMs: A Survey; Boxi Cao et al
  • A Survey on Mixture of Experts; Weilin Cai et al
  • The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective; Zhen Qin et al
  • Retrieval-Augmented Generation for Large Language Models: A Survey; Yunfan Gao et al

Position Paper

  • Eight Things to Know about Large Language Models; Samuel R. Bowman et al
  • A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models; Oana Ignat et al
  • Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models; Yuxi Ma et al
  • Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models; Lingxi Xie et al
  • A Path Towards Autonomous Machine Intelligence; Yann LeCun et al
  • GPT-4 Can’t Reason; Konstantine Arkoudas et al
  • Cognitive Architectures for Language Agents; Theodore Sumers et al
  • Large Search Model: Redefining Search Stack in the Era of LLMs; Liang Wang et al
  • PROAGENT: FROM ROBOTIC PROCESS AUTOMATION TO AGENTIC PROCESS AUTOMATION; Yining Ye et al
  • Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning; Zhiting Hu et al
  • A Roadmap to Pluralistic Alignment; Taylor Sorensen et al
  • Towards Unified Alignment Between Agents, Humans, and Environment; Zonghan Yang et al
  • Video as the New Language for Real-World Decision Making; Sherry Yang et al
  • A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI; Seliem El-Sayed et al

Structure

  • Finding Structural Knowledge in Multimodal-BERT; Victor Milewski et al
  • Going Beyond Nouns With Vision & Language Models Using Synthetic Data; Paola Cascante-Bonilla et al
  • Measuring Progress in Fine-grained Vision-and-Language Understanding; Emanuele Bugliarello et al
  • PV2TEA: Patching Visual Modality to Textual-Established Information Extraction; Hejie Cui et al

Event Extraction

  • Cross-media Structured Common Space for Multimedia Event Extraction; Manling Li et al; Focus on image-text event extraction. A new benchmark and baseline are proposed.
  • Visual Semantic Role Labeling for Video Understanding; Arka Sadhu et al; A new benchmark is proposed.
  • GAIA: A Fine-grained Multimedia Knowledge Extraction System; Manling Li et al; Demo paper. Extract knowledge (relation, event) from multimedia data.
  • MMEKG: Multi-modal Event Knowledge Graph towards Universal Representation across Modalities; Yubo Ma et al

Situation Recognition

  • Situation Recognition: Visual Semantic Role Labeling for Image Understanding; Mark Yatskar et al; Focus on image understanding. Given images, do the semantic role labeling task. No text available. A new benchmark and baseline are proposed.
  • Commonly Uncommon: Semantic Sparsity in Situation Recognition; Mark Yatskar et al; Address the long-tail problem.
  • Grounded Situation Recognition; Sarah Pratt et al
  • Rethinking the Two-Stage Framework for Grounded Situation Recognition; Meng Wei et al
  • Collaborative Transformers for Grounded Situation Recognition; Junhyeong Cho et al

Scene Graph

  • Action Genome: Actions as Composition of Spatio-temporal Scene Graphs; Jingwei Ji et al; Spatio-temporal scene graphs (video).
  • Unbiased Scene Graph Generation from Biased Training; Kaihua Tang et al
  • Visual Distant Supervision for Scene Graph Generation; Yuan Yao et al
  • Learning to Generate Scene Graph from Natural Language Supervision; Yiwu Zhong et al
  • Weakly Supervised Visual Semantic Parsing; Alireza Zareian, Svebor Karaman, Shih-Fu Chang
  • Scene Graph Prediction with Limited Labels; Vincent S. Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, Li Fei-Fei
  • Neural Motifs: Scene Graph Parsing with Global Context; Rowan Zellers et al
  • Fine-Grained Scene Graph Generation with Data Transfer; Ao Zhang et al
  • Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning; Tao He et al
  • COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION; Kaifeng Gao et al; Video.
  • LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation; Xiaoguang Chang et al
  • TRANSFORMER-BASED IMAGE GENERATION FROM SCENE GRAPHS; Renato Sortino et al
  • The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation; Lin Li et al
  • Knowledge-augmented Few-shot Visual Relation Detection; Tianyu Yu et al
  • Prototype-based Embedding Network for Scene Graph Generation; Chaofan Zhen et al
  • Unified Visual Relationship Detection with Vision and Language Models; Long Zhao et al
  • Structure-CLIP: Enhance Multi-modal Language Representations with Structure Knowledge; Yufeng Huang et al

Attribute

  • COCO Attributes: Attributes for People, Animals, and Objects; Genevieve Patterson et al
  • Human Attribute Recognition by Deep Hierarchical Contexts; Yining Li et al; Attribute prediction in specific domains.
  • Emotion Recognition in Context; Ronak Kosti et al; Attribute prediction in specific domains.
  • The iMaterialist Fashion Attribute Dataset; Sheng Guo et al; Attribute prediction in specific domains.
  • Learning to Predict Visual Attributes in the Wild; Khoi Pham et al
  • Open-vocabulary Attribute Detection; Marıa A. Bravo et al
  • OvarNet: Towards Open-vocabulary Object Attribute Recognition; Keyan Chen et al

Compositionality

  • CREPE: Can Vision-Language Foundation Models Reason Compositionally?; Zixian Ma et al
  • Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality; Tristan Thrush et al
  • WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT?; Mert Yuksekgonul et al
  • GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering; Drew A. Hudson et al
  • COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images; Ben Bogin et al
  • Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension; Zhenfang Chen et al
  • **Do
项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

白日梦AI

白日梦AI提供专注于AI视频生成的多样化功能,包括文生视频、动态画面和形象生成等,帮助用户快速上手,创造专业级内容。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

讯飞绘镜

讯飞绘镜是一个支持从创意到完整视频创作的智能平台,用户可以快速生成视频素材并创作独特的音乐视频和故事。平台提供多样化的主题和精选作品,帮助用户探索创意灵感。

Project Cover

讯飞文书

讯飞文书依托讯飞星火大模型,为文书写作者提供从素材筹备到稿件撰写及审稿的全程支持。通过录音智记和以稿写稿等功能,满足事务性工作的高频需求,帮助撰稿人节省精力,提高效率,优化工作与生活。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号