Awesome-Foundation-Models

视觉语言基础模型精选资源库

基础模型多模态计算机视觉自然语言处理深度学习 Github 开源项目

Awesome-Foundation-Models项目提供视觉和语言基础模型的精选资源列表，涵盖最新研究论文、综述文章和开源代码。内容包括图像、视频和多模态等领域，助力研究者和开发者追踪前沿进展、了解研究动态和寻找实用实现。该资源库为人工智能领域提供全面而权威的参考。

Github

介绍相关项目

Awesome-Foundation-Models

A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.

Survey

2024

Towards Vision-Language Geo-Foundation Model: A Survey (Nanyang Technological University)
An Introduction to Vision-Language Modeling (from Meta)
The Evolution of Multimodal Model Architectures (from Purdue University)
Efficient Multimodal Large Language Models: A Survey (from Tencent)
Foundation Models for Video Understanding: A Survey (from Aalborg University)
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (from GigaAI)
Prospective Role of Foundation Models in Advancing Autonomous Vehicles (from Tongji University)
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (from Northeastern University)
A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (from Lehigh University)
Large Multimodal Agents: A Survey (from CUHK)
The Uncanny Valley: A Comprehensive Analysis of Diffusion Models (from Mila)
Real-World Robot Applications of Foundation Models: A Review (from University of Tokyo)
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities (from Shanghai AI Lab)

Before 2024

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
Towards Generalist Foundation Model for Radiology (from SJTU)
Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
Towards Generalist Biomedical AI (from Google)
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from Oxford)
Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft)
A Survey on Multimodal Large Language Models (from USTC and Tencent)
Vision-Language Models for Vision Tasks: A Survey (from Nanyang Technological University)
Foundation Models for Generalist Medical Artificial Intelligence (from Stanford)
A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
Vision-language pre-training: Basics, recent advances, and future trends
On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)

Papers by Date

2024

[08/14] Imagen 3 (from Google Deepmind)
[07/31] The Llama 3 Herd of Models (from Meta)
[07/29] SAM 2: Segment Anything in Images and Videos (from Meta)
[07/24] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects (from HUST and ByteDance)
[07/17] EVE: Unveiling Encoder-Free Vision-Language Models (from BAAI)
[07/12] Transformer Layers as Painters (from Sakana AI)
[06/24] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (from NYU)
[06/13] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (from EPFL and Apple)
[06/10] Merlin: A Vision Language Foundation Model for 3D Computed Tomography (from Stanford. Code will be available.)
[06/06] Vision-LSTM: xLSTM as Generic Vision Backbone (from LSTM authors)
[05/31] MeshXL: Neural Coordinate Field for Generative 3D Foundation Models (from Fudan)
[05/22] Attention as an RNN (from Mila & Borealis AI)
[05/22] GigaPath: A whole-slide foundation model for digital pathology from real-world data (from Nature)
[05/21] BiomedParse: a biomedical foundation model for biomedical image parsing (from Microsoft)
[05/20] Octo: An Open-Source Generalist Robot Policy (from UC Berkeley)
[05/17] Observational Scaling Laws and the Predictability of Language Model Performance (fro Standford)
[05/14] Understanding the performance gap between online and offline alignment algorithms (from Google)
[05/09] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (from Shanghai AI Lab)
[05/08] You Only Cache Once: Decoder-Decoder Architectures for Language Models
[05/06] Advancing Multimodal Medical Capabilities of Gemini (from Google)
[05/07] xLSTM: Extended Long Short-Term Memory (from Sepp Hochreiter, the author of LSTM.)
[05/03] Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
[04/30] KAN: Kolmogorov-Arnold Networks (Promising alternatives of MLPs. from MIT)
[04/26] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (InternVL 1.5. from Shanghai AI Lab)
[04/14] TransformerFAM: Feedback attention is working memory (from Google. Efficient attention.)
[04/10] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (from Google)
[04/02] Octopus v2: On-device language model for super agent (from Stanford)
[04/02] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (from Google)
[03/22] InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding (from Shanghai AI Lab)
[03/18] Arc2Face: A Foundation Model of Human Faces (from Imperial College London)
[03/14] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (30B parameters. from Apple)
[03/09] uniGradICON: A Foundation Model for Medical Image Registration (from UNC-Chapel Hill)
[03/05] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3. from Stability AI)
[03/01] Learning and Leveraging World Models in Visual Representation Learning (from Meta)
[03/01] VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (from Meituan)
[02/28] CLLMs: Consistency Large Language Models (from SJTU)
[02/27] Transparent Image Layer Diffusion using Latent Transparency (from Standford)
[02/22] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (from Meta)
[02/21] Beyond A∗: Better Planning with Transformers via Search Dynamics Bootstrapping (from Meta)
[02/20] Neural Network Diffusion (Generating network parameters via diffusion models. from NUS)
[02/20] VideoPrism: A Foundational Visual Encoder for Video Understanding (from Google)
[02/19] FiT: Flexible Vision Transformer for Diffusion Model (from Shanghai AI Lab)
[02/06] MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (from Meituan)
[01/30] [YOLO-World: Real-Time Open-Vocabulary Object

相关项目

项目侧边栏1

项目侧边栏2

推荐项目

Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手，通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能，支持100+编程语言，与主流编辑器无缝集成，显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台，能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐，Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案，通过智能技术简化视频制作流程。无论是企业宣传还是个人分享，有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持，能够阅读和理解用户上传的文件内容，解析网页信息，并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题，Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术，为商家提供一键生成商品图和营销文案的服务，显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台，让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型，免费访问各种图像生成与AI训练工具，从Stable Diffusion等基础模型开始，轻松实现创新图像生成。体验前沿的AI技术，推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器，它将改变您观看视频的方式！SubCat结合了先进的人工智能技术，为您提供即时视频字幕翻译，无论是本地视频还是网络流媒体，让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台，利用前沿AI技术，为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图，再到文案生成，美间让创意设计更简单、更高效。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具，简化了选题、文献检索至论文撰写的整个过程。通过简单设定，平台可快速生成高质量论文大纲和全文，配合图表、参考文献等一应俱全，同时提供开题报告和答辩PPT等增值服务，保障数据安全，有效提升写作效率和论文质量。

使用协议隐私政策广告服务

投诉举报邮箱: service@vectorlightyear.com

@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号