Awesome Foundation Model Leaderboard is a curated list of awesome foundation model leaderboards (for an explanation of what a leaderboard is, please refer to this post), along with various development tools and evaluation organizations according to our survey:
On the Workflows and Smells of Leaderboard Operations (LBOps):
An Exploratory Study of Foundation Model Leaderboards
Zhimin (Jimmy) Zhao, Abdul Ali Bangash, Filipe Roseiro Côgo, Bram Adams, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
If you find this repository useful, please consider giving us a star :star: and citation:
@article{zhao2024workflows,
title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
journal={arXiv preprint arXiv:2407.04065},
year={2024}
}
Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.
If you want to contribute to this list (please do), welcome to propose a pull request.
If you have any suggestions, critiques, or questions regarding this list, welcome to raise an issue.
Also, a leaderboard should be included if only:
- It is actively maintained.
- It is related to foundation models.
Table of Contents
Tools
Name | Description |
---|---|
gradio_leaderboard | gradio_leaderboard helps users build fully functional and performant leaderboard demos with gradio. |
Demo leaderboard | Demo leaderboard helps users easily deploy their leaderboards with a standardized template. |
Leaderboard Explorer | Leaderboard Explorer helps users navigate the diverse range of leaderboards available on Hugging Face Spaces. |
open_llm_leaderboard | open_llm_leaderboard helps users access Open LLM Leaderboard data easily. |
open-llm-leaderboard-renamer | open-llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily. |
Open LLM Leaderboard Results PR Opener | Open LLM Leaderboard Results PR Opener helps users showcase Open LLM Leaderboard results in their model cards. |
Open LLM Leaderboard Scraper | Open LLM Leaderboard Scraper helps users scrape and export data from Open LLM Leaderboard. |
Organizations
Name | Description |
---|---|
Allen Institute for AI | Allen Institute for AI is a non-profit research institute with the mission of conducting high-impact AI research and engineering in service of the common good. |
Papers With Code | Papers With Code is a community-driven platform for learning about state-of-the-art research papers on machine learning. |
Evaluations
Model-oriented
Comprehensive
Name | Description |
---|---|
CompassRank | CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation mdoels for the industry and research. |
FlagEval | FlagEval is a comprehensive platform for evaluating foundation models. |
GenAI-Arena | GenAI-Arena hosts the visual generation arena, where various vision models compete based on their performance in image generation, image edition, and video generation. |
Holistic Evaluation of Language Models | Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models. |
nuScenes | nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car. |
SuperCLUE | SuperCLUE is a series of benchmarks for evaluating Chinese foundation models. |
Text
Name | Description |
---|---|
ACLUE | ACLUE is an evaluation benchmark for ancient Chinese language comprehension. |
AIR-Bench | AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
AlignBench | AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. |
AlpacaEval | AlpacaEval is an automatic evaluator designed for instruction-following LLMs. |
ANGO | ANGO is a generation-oriented Chinese language model evaluation benchmark. |
Arabic Tokenizers Leaderboard | Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms. |
Arena-Hard-Auto | Arena-Hard-Auto is a benchmark for instruction-tuned LLMs. |
Auto-Arena | Auto-Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance. |
BeHonest | BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
BenBench | BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities. |
BiGGen-Bench | BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
Biomedical Knowledge Probing Leaderboard | Biomedical Knowledge Probing Leaderboard aims to track, rank, and evaluate biomedical factual knowledge probing results in LLMs. |
BotChat | BotChat assesses the multi-round chatting capabilities of LLMs through a proxy task, evaluating whether two ChatBot instances can engage in smooth and fluent conversation with each other. |
C-Eval | C-Eval is a Chinese evaluation suite for LLMs. |
C-Eval Hard | C-Eval Hard is a more challenging version of C-Eval, which involves complex LaTeX equations and requires non-trivial reasoning abilities to solve. |
Capability leaderboard | Capability leaderboard is a platform to evaluate long context understanding capabilties of LLMs. |
Chain-of-Thought Hub | Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
ChineseFactEval | ChineseFactEval is a factuality benchmark for Chinese LLMs. |
CLEM | CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents. |
CLiB | CLiB is a benchmark to evaluate Chinese LLMs. |
CMMLU | CMMLU is a Chinese benchmark to evaluate LLMs' knowledge and reasoning capabilities. |
CMB | CMB is a multi-level medical benchmark in Chinese. |
CMMLU | CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context. |
CMMMU | CMMMU is a benchmark to test the capabilities of multimodal models in understanding and reasoning across multiple disciplines in the Chinese context. |
CompMix | CompMix is a benchmark for heterogeneous question answering. |
Compression Leaderboard | Compression Leaderboard is a platform to evaluate the compression performance of LLMs. |
CoTaEval | CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs. |
ConvRe | ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations. |
CriticBench | CriticBench is a benchmark to evaluate LLMs' ability to make critique responses. |
CRM LLM Leaderboard | CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications. |
DecodingTrust | DecodingTrust is an assessment platform to evaluate the trustworthiness of LLMs. |
Domain LLM Leaderboard | Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs. |
DyVal | DyVal is a dynamic evaluation protocol for LLMs. |
Enterprise Scenarios leaderboard | Enterprise Scenarios Leaderboard aims to assess the performance of LLMs on real-world enterprise use cases. |
EQ-Bench | EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
Factuality Leaderboard | Factuality Leaderboard compares the factual capabilities of LLMs. |
FuseReviews | FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization. |
FELM | FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs. |
GAIA | GAIA aims to test fundamental abilities that an AI assistant should possess. |
GPT-Fathom | GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
Guerra LLM AI Leaderboard | Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others. |
Hallucinations Leaderboard | Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs. |
HalluQA | HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs. |
HellaSwag | HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs. |
HHEM Leaderboard | HHEM Leaderboard evaluates how often a language model introduces hallucinations when summarizing a document. |
IFEval | IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions. |
Indic LLM Leaderboard | Indic LLM Leaderboard is a benchmark to track progress and rank the performance of Indic LLMs. |
InstructEval | InstructEval is an evaluation suite to assess instruction selection methods in the context of LLMs. |
Japanese Chatbot Arena | Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese. |
JustEval | JustEval is a powerful tool designed for fine-grained evaluation of LLMs. |
Ko Chatbot Arena | Ko Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Korean. |
KoLA | KoLA is a benchmark to evaluate the world knowledge of LLMs. |
L-Eval | L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context. |
Language Model Council | Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement. |
LawBench | LawBench is a benchmark to evaluate the legal capabilities of |