AlpacaEval : An Automatic Evaluator for Instruction-following Language Models

AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of 0.98 with ChatBot Arena while costing less than $10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (< 5min), cheap (< $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks:

LC AlpacaEval is the most highly correlated benchmark with Chat Arena.

Updates:

:tada: Length-controlled Win Rates are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details here.

:tada: AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More details here. For the old version, set your environment variable IS_ALPACA_EVAL_2=False.

Table of Contents

Overview
Quick Start
Leaderboards and how to interpret them
- Models
- Evaluators
Use-cases
Contributing
Limitations
Analysis
- Analyzing an evaluator
- Analyzing an eval set
Citation
Additional information

Overview

Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap, replicable, and validated against 20K human annotations. It is particularly useful for model development. Although we improved over prior automatic evaluation pipelines, there are still fundamental limitations like the preference for longer outputs. AlpacaEval provides the following:

Leaderboard: a leaderboard of common models on the AlpacaEval evaluation set. Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).
Automatic evaluator: an automatic evaluator that has high agreement with humans (validated on 20K annotations). We evaluate a model by measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model over outputs from a reference model. Our evaluators enable caching and output randomization by default.
Toolkit for building automatic evaluators: a simple interface for building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality, price, speed, statistical power, bias, variance etc).
Human evaluation data: 20K human preferences between a given and reference model on the AlpacaFarm evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).
AlpacaEval dataset: a simplification of AlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. Details here.

When to use and not use AlpacaEval?

When to use AlpacaEval? Our automatic evaluator is a quick and cheap proxy for human evaluation of simple instruction-following tasks. It is useful if you have to run many evaluations quickly, e.g., during model development.

When not to use AlpacaEval? As any other automatic evaluator, AlpacaEval should not replace human evaluation in high-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact that (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic evaluators may have biases such as favoring style over factuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause. Details in limitations.

Quick Start

To install the stable release, run

pip install alpaca-eval

To install the nightly version, run

pip install git+https://github.com/tatsu-lab/alpaca_eval

Then you can use it as follows:

export OPENAI_API_KEY=<your_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md 
alpaca_eval --model_outputs 'example/outputs.json'

This will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as the model_outputs file. Important parameters are the following:

model_outputs : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary should contain the keys instruction and output.
annotators_config: This is the annotator to use. We recommend using weighted_alpaca_eval_gpt4_turbo ( default for AlpacaEval 2.0), which has a high agreement rate with our human annotation data, large context size, and is pretty cheap. For a comparison of all annotators see here.
reference_outputs: The outputs of the reference model. Same format as model_outputs. By default, this is gpt4_turbo for AlpacaEval 2.0.
output_path: Path for saving annotations and leaderboard.

If you don't have the model outputs, you can use evaluate_from_model and pass a local path or a name of a HuggingFace model, or a model from a standard API (OpenAI, Anthropic, Cohere, google, ...). Other commands:

>>> alpaca_eval -- --help

SYNOPSIS
    alpaca_eval COMMAND

COMMANDS
    COMMAND is one of the following:

     evaluate
       Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.

     evaluate_from_model
       Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.

     make_leaderboard
       Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.

     analyze_evaluators
       Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).

For more information about each function use alpaca_eval <command> -- --help.

Leaderboards and how to interpret them

Models

Our leaderboards are computed on the AlpacaEval dataset. We precomputed the leaderboard for important models using different baseline models and autoannotators. Our two main leaderboards ("AlpacaEval 2.0" and "AlpacaEval") can be found on this page. "AlpacaEval 2.0" uses weighted_alpaca_eval_gpt4_turbo for the annotator and gpt4_turbo for the baseline. "AlpacaEval" uses alpaca_eval_gpt4 for the annotator and text_davinci_003 for the baseline. For all precomputed leaderboards see here. Later we also show how to add your model to the leaderboard and how to make a new leaderboard for your evaluator/dataset. See here for the configs of all models that are available out of the box.

AlpacaEval minimal leaderboard:

	Win Rate	Std Error
gpt4	95.3	0.7
claude	88.4	1.1
chatgpt	86.1	1.2
guanaco-65b	71.8	1.6
vicuna-13b	70.4	1.6
text_davinci_003	50.0	0.0
alpaca-farm-ppo-human	41.2	1.7
alpaca-7b	26.5	1.5
text_davinci_001	15.2	1.2

How exactly are those metrics computed?

Win Rate: the win rate measures the fraction of time the model's output is preferred over the reference's outputs (test-davinci-003 for AlpacaEval and gpt4_turbo for AlpacaEval 2.0). More specifically, to compute the win rate we collect pairs of outputs of the desired model on every instruction from the ApacaEval dataset. We then pair each output with the output of our reference model (e.g. text-davinci-003) on the same instruction. We then ask our automatic evaluator which output they prefer. See AlpacaEval's and AlpacaEval 2.0's prompts and configs, in particular we randomize the order of outputs to avoid position bias. We then average the preferences over all instructions in the dataset to get the win rate of the model over the baseline. If both outputs are exactly the same we use a half preference for both models.

Standard error: this is the standard error (normalized by N-1) of the win rate, i.e., the preferences averaged over the different instructions.

Details about our auto-annotator: alpaca_eval_gpt4

Our alpaca_eval_gpt4 ( see configs) annotator averages over preferences, where preferences are obtained as follows:

it takes in an instruction and a pair of outputs (from the desired model and the reference model)
if a preference was this triple was already computed, it returns it (i.e. it uses caching)
it randomizes the order of the outputs to avoid position bias
it formats the instruction and outputs into the following zero-shot prompt, which asks to order the outputs in order of preference
it completes the prompt using GPT4 with temperature=0
it parses the preference from the completions and returns it

The annotator is a mix between (and was highly influenced by) AlpacaFarm and Aviary evaluators. In particular, we use the same code as for AlpacaFarm (caching/randomization/hyperparameters) but use a ranking prompt similar to that of Aviary. We make changes to Aviary's prompt to decrease the bias for longer outputs. Details in Related work.

For AlpacaEval 2.0 we use weighted_alpaca_eval_gpt4_turbo, which uses logprobs to compute continuous preference and uses GPT4_turbo as model ( see configs).

Evaluators

We evaluate different automatic annotators on the AlpacaEval set by comparing to 2.5K human annotations we collected (~650 instructions each with 4 human annotations). Below we show metrics for our suggested evaluators (weighted_alpaca_eval_gpt4_turbo,alpaca_eval_gpt4), for prior automatic evaluators (alpaca_farm_greedy_gpt4,aviary_gpt4,lmsys_gpt4), for humans (humans), and for different base models with essentially the same prompt (gpt4,claude,text_davinci_003,chatgpt_fn,guanaco_33b, chatgpt). See here for the configs of all evaluators that are available out of the box and their associated metrics.

	Human agreement	Price [$/1000 examples]	Time [seconds/1000 examples]	Spearman corr.	Pearson corr.	Bias	Variance	Proba. prefer longer
alpaca_eval_gpt4	69.2	13.6	1455	0.97	0.93