Awesome-Foundation-Models
A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.
Survey
2024
- Towards Vision-Language Geo-Foundation Model: A Survey (Nanyang Technological University)
- An Introduction to Vision-Language Modeling (from Meta)
- The Evolution of Multimodal Model Architectures (from Purdue University)
- Efficient Multimodal Large Language Models: A Survey (from Tencent)
- Foundation Models for Video Understanding: A Survey (from Aalborg University)
- Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (from GigaAI)
- Prospective Role of Foundation Models in Advancing Autonomous Vehicles (from Tongji University)
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (from Northeastern University)
- A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (from Lehigh University)
- Large Multimodal Agents: A Survey (from CUHK)
- The Uncanny Valley: A Comprehensive Analysis of Diffusion Models (from Mila)
- Real-World Robot Applications of Foundation Models: A Review (from University of Tokyo)
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities (from Shanghai AI Lab)
Before 2024
- Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
- Towards Generalist Foundation Model for Radiology (from SJTU)
- Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
- Towards Generalist Biomedical AI (from Google)
- A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from Oxford)
- Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft)
- A Survey on Multimodal Large Language Models (from USTC and Tencent)
- Vision-Language Models for Vision Tasks: A Survey (from Nanyang Technological University)
- Foundation Models for Generalist Medical Artificial Intelligence (from Stanford)
- A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
- Vision-language pre-training: Basics, recent advances, and future trends
- On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)
Papers by Date
2024
- [08/14] Imagen 3 (from Google Deepmind)
- [07/31] The Llama 3 Herd of Models (from Meta)
- [07/29] SAM 2: Segment Anything in Images and Videos (from Meta)
- [07/24] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects (from HUST and ByteDance)
- [07/17] EVE: Unveiling Encoder-Free Vision-Language Models (from BAAI)
- [07/12] Transformer Layers as Painters (from Sakana AI)
- [06/24] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (from NYU)
- [06/13] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (from EPFL and Apple)
- [06/10] Merlin: A Vision Language Foundation Model for 3D Computed Tomography (from Stanford. Code will be available.)
- [06/06] Vision-LSTM: xLSTM as Generic Vision Backbone (from LSTM authors)
- [05/31] MeshXL: Neural Coordinate Field for Generative 3D Foundation Models (from Fudan)
- [05/22] Attention as an RNN (from Mila & Borealis AI)
- [05/22] GigaPath: A whole-slide foundation model for digital pathology from real-world data (from Nature)
- [05/21] BiomedParse: a biomedical foundation model for biomedical image parsing (from Microsoft)
- [05/20] Octo: An Open-Source Generalist Robot Policy (from UC Berkeley)
- [05/17] Observational Scaling Laws and the Predictability of Language Model Performance (fro Standford)
- [05/14] Understanding the performance gap between online and offline alignment algorithms (from Google)
- [05/09] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (from Shanghai AI Lab)
- [05/08] You Only Cache Once: Decoder-Decoder Architectures for Language Models
- [05/06] Advancing Multimodal Medical Capabilities of Gemini (from Google)
- [05/07] xLSTM: Extended Long Short-Term Memory (from Sepp Hochreiter, the author of LSTM.)
- [05/03] Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
- [04/30] KAN: Kolmogorov-Arnold Networks (Promising alternatives of MLPs. from MIT)
- [04/26] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (InternVL 1.5. from Shanghai AI Lab)
- [04/14] TransformerFAM: Feedback attention is working memory (from Google. Efficient attention.)
- [04/10] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (from Google)
- [04/02] Octopus v2: On-device language model for super agent (from Stanford)
- [04/02] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (from Google)
- [03/22] InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding (from Shanghai AI Lab)
- [03/18] Arc2Face: A Foundation Model of Human Faces (from Imperial College London)
- [03/14] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (30B parameters. from Apple)
- [03/09] uniGradICON: A Foundation Model for Medical Image Registration (from UNC-Chapel Hill)
- [03/05] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3. from Stability AI)
- [03/01] Learning and Leveraging World Models in Visual Representation Learning (from Meta)
- [03/01] VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (from Meituan)
- [02/28] CLLMs: Consistency Large Language Models (from SJTU)
- [02/27] Transparent Image Layer Diffusion using Latent Transparency (from Standford)
- [02/22] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (from Meta)
- [02/21] Beyond A∗: Better Planning with Transformers via Search Dynamics Bootstrapping (from Meta)
- [02/20] Neural Network Diffusion (Generating network parameters via diffusion models. from NUS)
- [02/20] VideoPrism: A Foundational Visual Encoder for Video Understanding (from Google)
- [02/19] FiT: Flexible Vision Transformer for Diffusion Model (from Shanghai AI Lab)
- [02/06] MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (from Meituan)
- [01/30] [YOLO-World: Real-Time Open-Vocabulary Object