Project Icon

prize

探索大型语言模型性能反向扩展现象

Inverse Scaling Prize比赛旨在发现大型语言模型性能反向扩展的任务。该比赛探索随着模型规模增大,在特定任务上表现反而下降的现象。这有助于揭示语言模型预训练和扩展的潜在问题,对模型的安全和负责任使用具有重要意义。比赛将评估提交的任务,并将优秀成果纳入基准测试,为语言模型研究提供新的洞察。

Two graphs, one with regular scaling marked 'Many tasks like this', and one with inverse scaling marked 'Any tasks like this?'

Inverse Scaling Prize

TL;DR: Win up to $100,000 for finding an important task where larger language models do worse.

Submissions due August 27, 2022 (Round 1) and October 27, 2022 (Round 2).

The contest has ended! Results: Round 1, Round 2.

Recent changes

11 October, 2023

21 March, 2023

  • Updated prize pool info

1 March, 2023

17 December, 2022

  • Updated prize eligibility for FAR employees

12 December, 2022

  • Added prize terms update to the ‘Prize information’ section
  • Updated ‘About us’

9 October, 2022

  • Added Huggingface Hub evaluation setup description to the tips section

4 October, 2022

  • BUG FIX: Reported total probabilities should now be more accurate for all classification tasks

26 September, 2022

  • Demonstrating positive scaling on the ‘incorrect’ answer is now allowed
  • Added stronger recommendation to aim for roughly 1000 examples
  • Added requirement to name the task
  • Added request to include data for control experiments
  • Added field to specify how the data was generated
  • Added field for links to dataset sources
  • Added field for code that generated dataset
  • Added requirement that submissions in multiple parts should upload all .csv files together in one .zip
  • Added request to make file names anonymous
  • Added option to use a variable number of classes in classification datasets
  • Added print out to colabs of the total probability given to class labels
  • Added reminder that the submitted plot should be from our official colab
  • Added request that people edit their form submission rather than resubmit to update
  • Added reminder to specify correct behavior on the task
  • Added field to specify whether the task is zero-shot or few-shot
  • Updated terms and conditions

Motivation

As language models get larger, they seem to only get better. Larger language models score better on benchmarks and unlock new capabilities like arithmetic [1], few-shot learning [1], and multi-step reasoning [2]. However, language models are not without flaws, exhibiting many biases [3] and producing plausible misinformation [4]. The purpose of this contest is to find evidence for a stronger failure mode: tasks where language models get worse as they become better at language modeling (next word prediction).

The standard paradigm in natural language processing today is to pretrain large language models to autocomplete text corpora. The resulting models are then either frozen and used directly for other tasks (zero-shot or using few-shot learning), or additionally trained on other tasks (fine-tuning). We focus on the case of zero-shot/few-shot evaluation on downstream tasks without task-specific gradient optimization: it's typically easier to use in practice and to study.

Scaling laws [5][6] show that language models get predictably better (in terms of test loss and downstream performance [7]) as the number of parameters, amount of compute used, and dataset size increase. The improvement follows a power law in each of parameters, compute, and dataset size. We hypothesize that there are tasks with trends in the opposite direction: task performance gets monotonically, predictably worse as the overall test loss of the language model improves. We call this phenomenon inverse scaling, in contrast with the standard scaling laws. There are some tasks that appear to show inverse scaling under some conditions [4][8][10], but such tasks appear to be rare.

This contest aims to find inverse scaling tasks, especially those of importance to the safe and responsible use of language models. We hope that task submissions will teach us more about what types of tasks exhibit inverse scaling; inverse scaling tasks will also highlight potential issues with the current paradigm of language model pretraining and scaling. Inverse scaling tasks are important because they represent a mismatch between the behavior we want language models to exhibit and the behavior we get in practice from the training objectives and data we use. As language models continue to get bigger and used in more real-world applications, it is important that they are not increasingly getting worse or harming users in yet-undetected ways.

After two rounds of the contest, we will write a survey of the submitted tasks and other examples found in the literature. Authors of winning tasks will be awarded prize money and invited to be co-authors on the resulting paper. Below, we detail our call for submissions. Feel free to join our Slack to message us with questions, find collaborators, and participate in contest-related discussions with other participants (code, ideas, findings, and related work sharing).

Prize information

2023/03/21 Update: The prize pool has been funded by Open Philanthropy

We will award up to $250,000 in total prize money for task submissions, distributed as follows:

  1. Up to 1 Grand Prize of $100,000.
  2. Up to 5 Second Prizes of $20,000 each.
  3. Up to 10 Third Prizes of $5,000 each.

All prize decisions will be made by the organizers and anonymous reviewers, using the Prize Rubric below. Prize winners may nominate a non-profit to receive the prize money on their behalf. Some prizes may remain unawarded if there are not enough tasks that meet the eligibility for a prize tier, as detailed in the Prize Rubric.

Benchmark and Co-authorship: Authors of prize-winning submissions will be invited as co-authors on the paper written after the contest concludes. We will also offer co-authorship to authors of submissions that met our acceptability criteria but did not receive prizes, in the event that we receive more acceptable submissions than we can award with prizes. We will include all accepted submissions in our final benchmark, which we plan to release to the research community after the contest.

Timeline: The contest begins on June 27, 2022. We will host a first round of evaluations on submissions received on or before August 27, 2022 (Anywhere on Earth) and a second, final round of evaluations on submissions received on or before October 27, 2022 (Anywhere on Earth). After the first round, we will award eligible tasks with third prizes (up to 5) and second prizes (up to 2). To help improve first-round submissions, we will also return reviewer feedback and scaling law plots/results from our private, evaluation models. Submissions will be paused for two weeks at the end of the first round to allow any necessary improvements to be made. At the end of the second round, we will reward eligible tasks at all prize tiers, with the possibility of upgrading first-round submissions to higher prize tiers based on both rounds of submissions.

Prize Rubric

Here, we detail our submission evaluation rubric. The rubric will guide an anonymous panel of reviewers in judging submissions for prizes. A submission must meet all criteria in the "Grand Prize" column to win the grand prize. Likewise, a submission must meet all criteria in the "Accepted Task" column to be accepted into our benchmark and for co-authorship on our paper. For second prizes, submissions must meet all "Accepted Task" criteria and some "Grand Prize" criteria. Third prizes must meet the "Accepted Task" criteria. We may receive more eligible submissions than we have prizes for a given tier. In this case, we will first break ties based on how many “Grand Prize” criteria are met and then by having reviewers make subjective rankings within tiers (e.g., more granular measures of how much various criteria are met or the relative difficulty or importance of each criterion met). We will consider inverse scaling trends on publicly-available models like GPT-3, as well as held-out, private models for which we will run evaluation.

CriterionDescriptionPrize Tier
No PrizeAccepted TaskGrand Prize
Inverse Scaling StrengthHow straight and steep is the inverse scaling trend on public models?Shows flat, very bumpy, or standard scaling.Shows approximately monotonic inverse scaling.Shows a clear, strictly monotonic inverse scaling trend.
Inverse Scaling GeneralityDo different models all show inverse scaling?No inverse scaling on private models.Shows inverse scaling on some public and some private models.Shows inverse scaling across all public and private models tested.
Task ImportanceIs the task important to the safe and responsible use of LMs, or for shedding light on where LMs fail? How strong are the arguments?Weak. No users or third parties would be harmed, and the task does not shed light on where LMs fail.Fairly convincing. Some LM users or third parties would be harmed by the discovered behavior, or the task sheds light on where LMs fail (e.g., sensitivity to prompts).Very convincing. Significant implications for how LM research or deployment will need to be developed to be reliably safe and effective.
Novelty and SurprisingnessIs inverse scaling on the task novel (not shown in prior work) and surprising?Not novel or surprising.Novel and somewhat surprising.Novel and surprising, teaching us something new about LMs.
Task CoverageAre the examples fully representative of the described task?Examples only cover a special subcategory or phrasing of the described task. There's no evidence of inverse scaling on other subcategories or phrasings.Examples cover different subcategories and phrasings for the described task.Examples cover almost all important task subcategories and phrasings, suggesting robust inverse scaling on the described task.
ReproducibilityDoes inverse scaling appear to occur if we reproduce the task based on its description?No, we see flat, very bumpy, or standard scaling. The particular examples submitted may have been over-optimized for inverse scaling, to the extent that the examples are unrepresentative of the described task.Yes, but to a lesser extent.Yes, to a similar or stronger extent.

Answering the below, optional questions in our submission form (in the free-form response) will make your task stand out more:

  • Does inverse scaling persist even if the model is conditioned with few-shot examples to behave correctly? If providing enough few-shot examples eliminates inverse scaling, how many examples are required for that?
  • Does inverse scaling persist even after fine-tuning on the task? Are there good reasons to think it would persist after fine-tuning?
  • Does inverse scaling persist for InstructGPT models trained with Reinforcement Learning from Human Feedback (RLHF)? To test this, you can use the same code as that for GPT-3 evaluation. We may also evaluate submissions on private RLHF models of various sizes from Anthropic [Bai et al. 2022].

We reserve the right to update the prize tier standards or criteria, e.g., between rounds if we observe submissions gaming them in some way.

Evaluation Eligibility: To be eligible for official review, a task submission must:

  1. Include a plot of loss vs. model size across ada, babbage, curie, and davinci GPT-3 models, using the provided code for GPT-3 evaluation. The plot must not show a standard scaling law. A very bumpy trend is okay for submission; we expect to observe cleaner scaling laws with our held-out evaluation models, where we observe clear scaling trends.
  2. Meet the formatting requirements described in the Submission Guidelines.
    • This requirement should already be satisfied if you are able to successfully run the evaluation code.
  3. Include a coherent description of the task.

Models

This contest uses pretrained autoregressive language models such as GPT-3. We offer Google colab notebooks for evaluating inverse scaling with the GPT-3, OPT, and GPT-2 model series when developing a task. However, to avoid overfitting to publicly available models, we use private models to run the evaluations for awarding prizes. Currently, we are using the series of pretrained language models (without additional finetuning) from Anthropic [Bai et al. 2022]. We are in discussions with other organizations to use their models, which may be added later on to strengthen the evaluation.

Reviewers

Prize decisions will be made by an anonymous panel of reviewers. Reviewers will be selected by the contest organizers and may include some organizers. Reviewers will have ML and NLP experience relevant to inverse scaling. The panel may contain some competition organizers. Reviewers will not be allowed to make submissions to the contest.

Submission guidelines

  1. Each task submission should be a language modeling test set (in the style of BIG-Bench) of inputs with corresponding answers, which will be evaluated according to one of four evaluation metrics (detailed later).
  2. This prize is to incentivize original work, so submissions should find a new phenomenon for which inverse scaling has not been previously documented.
    1. If a task has already shown inverse scaling in prior work (even if the original authors did not identify it as such) then it is ineligible for the contest.
    2. If an existing task has not been subjected to any kind of scaling analysis, then it is likely eligible for the contest.
    3. If you would like to check whether an existing task is eligible, message us on our Slack or email us at inverse.scaling@gmail.com with [PRIOR WORK] in the subject line and a link to where the task has previously been published.
  3. Data must be formatted as a .zip containing .csv files.
    • The zip should be called <task_name>.zip, where task_name is the name you provide for your submission in the form (e.g. lambada.zip).
    • The file should be called <task_name>.csv (e.g. lambada.csv).
      • If you have multiple parts to the same task, add -PART<i> to each (e.g. lambada-PART1.csv, lambada-PART2.csv).
      • If you have control experiments, add these as <task_name>-CONTROL<i>.csv (e.g. lambada-CONTROL1.csv).
    • The .csv files will be read using the pandas package, using the default arguments.
    • Specific formats are given below in the Evaluation metrics section.
  4. Examples will be given as a prompt to an autoregressive language model.
    • I.e., either zero-shot or few-shot prompts (prompts containing a few examples). Few-shot examples must demonstrate the correct behavior on the task.
  5. Tasks must contain at least 300 examples
    • We strongly recommend aiming for on the order of 1000 examples so that inverse scaling trends are clearer.
    • Submissions with unclear scaling trends and close to the minimum number of examples are unlikely to win a prize.
  6. In the submission form, you will be asked to add:
    1. Evaluation metric used
      • The metric should be one of:
        1. Classification loss in a multiple-choice format (classification).
        2. Loss on a sequence at the end of the prompt (sequence_prob).
        3. Difference in logodds between two possible responses (logodds).
        4. Absolute difference in logodds between two possible responses (absolute_logodds).
    2. Authors
    3. Task name
    4. Description of intended task
      • What is the task aiming to test?
      • Remember to explain what good behavior
项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型,免费访问各种图像生成与AI训练工具,从Stable Diffusion等基础模型开始,轻松实现创新图像生成。体验前沿的AI技术,推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器,它将改变您观看视频的方式!SubCat结合了先进的人工智能技术,为您提供即时视频字幕翻译,无论是本地视频还是网络流媒体,让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台,利用前沿AI技术,为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图,再到文案生成,美间让创意设计更简单、更高效。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号