AlpacaEval : An Automatic Evaluator for Instruction-following Language Models
AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of 0.98 with ChatBot Arena while costing less than $10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (< 5min), cheap (< $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks:
Updates:
:tada: Length-controlled Win Rates are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details here.
:tada: AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More details here. For the old version, set your environment variable IS_ALPACA_EVAL_2=False
.
Table of Contents
Overview
Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap, replicable, and validated against 20K human annotations. It is particularly useful for model development. Although we improved over prior automatic evaluation pipelines, there are still fundamental limitations like the preference for longer outputs. AlpacaEval provides the following:
- Leaderboard: a leaderboard of common models on the AlpacaEval evaluation set. Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).
- Automatic evaluator: an automatic evaluator that has high agreement with humans (validated on 20K annotations). We evaluate a model by measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model over outputs from a reference model. Our evaluators enable caching and output randomization by default.
- Toolkit for building automatic evaluators: a simple interface for building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality, price, speed, statistical power, bias, variance etc).
- Human evaluation data: 20K human preferences between a given and reference model on the AlpacaFarm evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).
- AlpacaEval dataset: a simplification of AlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. Details here.
When to use and not use AlpacaEval?
When to use AlpacaEval? Our automatic evaluator is a quick and cheap proxy for human evaluation of simple instruction-following tasks. It is useful if you have to run many evaluations quickly, e.g., during model development.
When not to use AlpacaEval? As any other automatic evaluator, AlpacaEval should not replace human evaluation in high-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact that (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic evaluators may have biases such as favoring style over factuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause. Details in limitations.
Quick Start
To install the stable release, run
pip install alpaca-eval
To install the nightly version, run
pip install git+https://github.com/tatsu-lab/alpaca_eval
Then you can use it as follows:
export OPENAI_API_KEY=<your_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md
alpaca_eval --model_outputs 'example/outputs.json'
This will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as the model_outputs
file. Important parameters are the following:
- model_outputs : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary
should
contain the keys
instruction
andoutput
. - annotators_config: This is the annotator to use. We recommend using
weighted_alpaca_eval_gpt4_turbo
( default for AlpacaEval 2.0), which has a high agreement rate with our human annotation data, large context size, and is pretty cheap. For a comparison of all annotators see here. - reference_outputs: The outputs of the reference model. Same format as
model_outputs
. By default, this isgpt4_turbo
for AlpacaEval 2.0. - output_path: Path for saving annotations and leaderboard.
If you don't have the model outputs, you can
use evaluate_from_model
and
pass a local path or a name of a
HuggingFace
model, or a model from a standard API (OpenAI, Anthropic, Cohere, google, ...). Other commands:
>>> alpaca_eval -- --help
SYNOPSIS
alpaca_eval COMMAND
COMMANDS
COMMAND is one of the following:
evaluate
Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.
evaluate_from_model
Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.
make_leaderboard
Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.
analyze_evaluators
Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).
For more information about each function use alpaca_eval <command> -- --help
.
Leaderboards and how to interpret them
Models
Our leaderboards are computed on the AlpacaEval dataset.
We precomputed the leaderboard for important models using different baseline models and autoannotators.
Our two main leaderboards ("AlpacaEval 2.0" and "AlpacaEval") can be found
on this page.
"AlpacaEval 2.0" uses weighted_alpaca_eval_gpt4_turbo
for the annotator and gpt4_turbo
for the baseline.
"AlpacaEval" uses alpaca_eval_gpt4
for the annotator and text_davinci_003
for the baseline.
For all precomputed leaderboards see here.
Later we also show how to add your model to the
leaderboard and how to make
a new leaderboard for your evaluator/dataset.
See here for the configs of all
models that are available out of the box.
AlpacaEval minimal leaderboard:
Win Rate | Std Error | |
---|---|---|
gpt4 | 95.3 | 0.7 |
claude | 88.4 | 1.1 |
chatgpt | 86.1 | 1.2 |
guanaco-65b | 71.8 | 1.6 |
vicuna-13b | 70.4 | 1.6 |
text_davinci_003 | 50.0 | 0.0 |
alpaca-farm-ppo-human | 41.2 | 1.7 |
alpaca-7b | 26.5 | 1.5 |
text_davinci_001 | 15.2 | 1.2 |
How exactly are those metrics computed?
Win Rate: the win rate measures the fraction of time the model's output is preferred over the reference's outputs (test-davinci-003
for AlpacaEval and gpt4_turbo
for AlpacaEval 2.0).
More specifically, to compute the win rate we collect pairs of outputs of the desired model on every instruction from
the ApacaEval dataset.
We then pair each output with the output of our reference model (e.g. text-davinci-003
) on the same instruction.
We then ask our automatic evaluator which output they prefer.
See AlpacaEval's
and AlpacaEval 2.0's prompts and configs, in particular we randomize the order of
outputs to avoid position bias.
We then average the preferences over all instructions in the dataset to get the win rate of the model over the baseline.
If both outputs are exactly the same we use a half preference for both models.
Standard error: this is the standard error (normalized by N-1) of the win rate, i.e., the preferences averaged over the different instructions.
Details about our auto-annotator: alpaca_eval_gpt4
Our alpaca_eval_gpt4
(
see configs)
annotator averages over preferences, where preferences are obtained as follows:
- it takes in an instruction and a pair of outputs (from the desired model and the reference model)
- if a preference was this triple was already computed, it returns it (i.e. it uses caching)
- it randomizes the order of the outputs to avoid position bias
- it formats the instruction and outputs into the following zero-shot prompt, which asks to order the outputs in order of preference
- it completes the prompt using GPT4 with
temperature=0
- it parses the preference from the completions and returns it
The annotator is a mix between (and was highly influenced by) AlpacaFarm and Aviary evaluators. In particular, we use the same code as for AlpacaFarm (caching/randomization/hyperparameters) but use a ranking prompt similar to that of Aviary. We make changes to Aviary's prompt to decrease the bias for longer outputs. Details in Related work.
For AlpacaEval 2.0 we use weighted_alpaca_eval_gpt4_turbo
, which uses logprobs to compute continuous preference and uses GPT4_turbo as model (
see configs).
Evaluators
We evaluate different automatic annotators on the AlpacaEval set by comparing to
2.5K human annotations
we collected (~650 instructions each with 4 human annotations).
Below we show metrics for our suggested evaluators (weighted_alpaca_eval_gpt4_turbo
,alpaca_eval_gpt4
), for prior
automatic
evaluators (alpaca_farm_greedy_gpt4
,aviary_gpt4
,lmsys_gpt4
),
for humans (humans
), and for different base models with essentially the same
prompt (gpt4
,claude
,text_davinci_003
,chatgpt_fn
,guanaco_33b
, chatgpt
).
See here for the configs of all
evaluators that are available out of the box and their associated metrics.
Human agreement | Price [$/1000 examples] | Time [seconds/1000 examples] | Spearman corr. | Pearson corr. | Bias | Variance | Proba. prefer longer | |
---|---|---|---|---|---|---|---|---|
alpaca_eval_gpt4 | 69.2 | 13.6 | 1455 | 0.97 | 0.93 |