Project Icon

alpaca_eval

精确且成本低的指令追随语言模型自动评估工具

AlpacaEval是一款基于LLM的自动评估工具,针对指令追随模型(如ChatGPT)的评估,具备快速、低成本和高度人类相关性(0.98)的特点。主要功能包括模型排行榜、自动评估器、评估器构建工具包及20K人工偏好数据。AlpacaEval 2.0通过长度控制胜率,提高了与ChatBot Arena的对应性,适合模型开发阶段的快速评估。

AlpacaEval : An Automatic Evaluator for Instruction-following Language Models

Code License Data License Python 3.10+ discord

AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of 0.98 with ChatBot Arena while costing less than $10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (< 5min), cheap (< $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks:

LC AlpacaEval is the most highly correlated benchmark with Chat Arena.


Updates:

:tada: Length-controlled Win Rates are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details here.

:tada: AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More details here. For the old version, set your environment variable IS_ALPACA_EVAL_2=False.


Table of Contents
  1. Overview
  2. Quick Start
  3. Leaderboards and how to interpret them
  4. Use-cases
  5. Contributing
  6. Limitations
  7. Analysis
  8. Citation
  9. Additional information

Overview

Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap, replicable, and validated against 20K human annotations. It is particularly useful for model development. Although we improved over prior automatic evaluation pipelines, there are still fundamental limitations like the preference for longer outputs. AlpacaEval provides the following:

  • Leaderboard: a leaderboard of common models on the AlpacaEval evaluation set. Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).
  • Automatic evaluator: an automatic evaluator that has high agreement with humans (validated on 20K annotations). We evaluate a model by measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model over outputs from a reference model. Our evaluators enable caching and output randomization by default.
  • Toolkit for building automatic evaluators: a simple interface for building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality, price, speed, statistical power, bias, variance etc).
  • Human evaluation data: 20K human preferences between a given and reference model on the AlpacaFarm evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).
  • AlpacaEval dataset: a simplification of AlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. Details here.
When to use and not use AlpacaEval?

When to use AlpacaEval? Our automatic evaluator is a quick and cheap proxy for human evaluation of simple instruction-following tasks. It is useful if you have to run many evaluations quickly, e.g., during model development.

When not to use AlpacaEval? As any other automatic evaluator, AlpacaEval should not replace human evaluation in high-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact that (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic evaluators may have biases such as favoring style over factuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause. Details in limitations.

Quick Start

To install the stable release, run

pip install alpaca-eval

To install the nightly version, run

pip install git+https://github.com/tatsu-lab/alpaca_eval

Then you can use it as follows:

export OPENAI_API_KEY=<your_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md 
alpaca_eval --model_outputs 'example/outputs.json' 

This will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as the model_outputs file. Important parameters are the following:

  • model_outputs : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary should contain the keys instruction and output.
  • annotators_config: This is the annotator to use. We recommend using weighted_alpaca_eval_gpt4_turbo ( default for AlpacaEval 2.0), which has a high agreement rate with our human annotation data, large context size, and is pretty cheap. For a comparison of all annotators see here.
  • reference_outputs: The outputs of the reference model. Same format as model_outputs. By default, this is gpt4_turbo for AlpacaEval 2.0.
  • output_path: Path for saving annotations and leaderboard.

If you don't have the model outputs, you can use evaluate_from_model and pass a local path or a name of a HuggingFace model, or a model from a standard API (OpenAI, Anthropic, Cohere, google, ...). Other commands:

>>> alpaca_eval -- --help
SYNOPSIS
    alpaca_eval COMMAND

COMMANDS
    COMMAND is one of the following:

     evaluate
       Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.

     evaluate_from_model
       Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.

     make_leaderboard
       Precompute and save an entire leaderboard for a given dataset / evaluator / set of models generations.

     analyze_evaluators
       Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).

For more information about each function use alpaca_eval <command> -- --help.

Leaderboards and how to interpret them

Models

Our leaderboards are computed on the AlpacaEval dataset. We precomputed the leaderboard for important models using different baseline models and autoannotators. Our two main leaderboards ("AlpacaEval 2.0" and "AlpacaEval") can be found on this page. "AlpacaEval 2.0" uses weighted_alpaca_eval_gpt4_turbo for the annotator and gpt4_turbo for the baseline. "AlpacaEval" uses alpaca_eval_gpt4 for the annotator and text_davinci_003 for the baseline. For all precomputed leaderboards see here. Later we also show how to add your model to the leaderboard and how to make a new leaderboard for your evaluator/dataset. See here for the configs of all models that are available out of the box.

AlpacaEval minimal leaderboard:

Win RateStd Error
gpt495.30.7
claude88.41.1
chatgpt86.11.2
guanaco-65b71.81.6
vicuna-13b70.41.6
text_davinci_00350.00.0
alpaca-farm-ppo-human41.21.7
alpaca-7b26.51.5
text_davinci_00115.21.2
How exactly are those metrics computed?

Win Rate: the win rate measures the fraction of time the model's output is preferred over the reference's outputs (test-davinci-003 for AlpacaEval and gpt4_turbo for AlpacaEval 2.0). More specifically, to compute the win rate we collect pairs of outputs of the desired model on every instruction from the ApacaEval dataset. We then pair each output with the output of our reference model (e.g. text-davinci-003) on the same instruction. We then ask our automatic evaluator which output they prefer. See AlpacaEval's and AlpacaEval 2.0's prompts and configs, in particular we randomize the order of outputs to avoid position bias. We then average the preferences over all instructions in the dataset to get the win rate of the model over the baseline. If both outputs are exactly the same we use a half preference for both models.

Standard error: this is the standard error (normalized by N-1) of the win rate, i.e., the preferences averaged over the different instructions.

Details about our auto-annotator: alpaca_eval_gpt4

Our alpaca_eval_gpt4 ( see configs) annotator averages over preferences, where preferences are obtained as follows:

  1. it takes in an instruction and a pair of outputs (from the desired model and the reference model)
  2. if a preference was this triple was already computed, it returns it (i.e. it uses caching)
  3. it randomizes the order of the outputs to avoid position bias
  4. it formats the instruction and outputs into the following zero-shot prompt, which asks to order the outputs in order of preference
  5. it completes the prompt using GPT4 with temperature=0
  6. it parses the preference from the completions and returns it

The annotator is a mix between (and was highly influenced by) AlpacaFarm and Aviary evaluators. In particular, we use the same code as for AlpacaFarm (caching/randomization/hyperparameters) but use a ranking prompt similar to that of Aviary. We make changes to Aviary's prompt to decrease the bias for longer outputs. Details in Related work.

For AlpacaEval 2.0 we use weighted_alpaca_eval_gpt4_turbo, which uses logprobs to compute continuous preference and uses GPT4_turbo as model ( see configs).

Evaluators

We evaluate different automatic annotators on the AlpacaEval set by comparing to 2.5K human annotations we collected (~650 instructions each with 4 human annotations). Below we show metrics for our suggested evaluators (weighted_alpaca_eval_gpt4_turbo,alpaca_eval_gpt4), for prior automatic evaluators (alpaca_farm_greedy_gpt4,aviary_gpt4,lmsys_gpt4), for humans (humans), and for different base models with essentially the same prompt (gpt4,claude,text_davinci_003,chatgpt_fn,guanaco_33b, chatgpt). See here for the configs of all evaluators that are available out of the box and their associated metrics.

Human agreementPrice [$/1000 examples]Time [seconds/1000 examples]Spearman corr.Pearson corr.BiasVarianceProba. prefer longer
alpaca_eval_gpt469.213.614550.970.93
项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型,免费访问各种图像生成与AI训练工具,从Stable Diffusion等基础模型开始,轻松实现创新图像生成。体验前沿的AI技术,推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器,它将改变您观看视频的方式!SubCat结合了先进的人工智能技术,为您提供即时视频字幕翻译,无论是本地视频还是网络流媒体,让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台,利用前沿AI技术,为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图,再到文案生成,美间让创意设计更简单、更高效。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号