指令微调数据集
大型语言模型指令微调的所有可用数据集
黄金标准数据集
- P3: https://github.com/bigscience-workshop/promptsource, https://huggingface.co/datasets/bigscience/P3
- 涵盖多种NLP任务的英语提示数据集集合
- 270个数据集中的2000种提示类型
- xP3: https://huggingface.co/datasets/bigscience/xP3mt
- 46种语言的13个训练任务混合,提示使用20种语言(从英语机器翻译而来)
- Natural Instructions v2: https://github.com/allenai/natural-instructions
- 1,616个多样化NLP任务及其专家编写的指令基准,涵盖76种不同任务类型和55种不同语言。
- The Flan Collection: https://github.com/google-research/FLAN/tree/main/flan/v2
- 包含此处部分数据集的超集
- 1836个任务,1500万个示例
- Open Assistant: https://huggingface.co/datasets/OpenAssistant/oasst1
- 人工标注的助手式对话语料,包含161,443条消息,分布在66,497个对话树中,涉及35种不同语言,标注了461,292个质量评级
- LIMA: 1000条高质量指令
- databricks-dolly-15k: https://github.com/databrickslabs/dolly/tree/master/data
- PRESTO: https://github.com/google-research-datasets/presto
- 55万条人类与虚拟助手之间的多语言上下文对话
- BB3x: https://parl.ai/projects/bb3x/
- InstructCTG: https://github.com/MichaelZhouwang/InstructCTG
- CrossFit: https://github.com/INK-USC/CrossFit
- tasksource: https://arxiv.org/abs/2301.05948
- ExMix: https://arxiv.org/abs/2111.10952
- InstructEval: https://github.com/declare-lab/instruct-eval
- M3IT: https://huggingface.co/datasets/MMInstruction/M3IT
- https://arxiv.org/abs/2306.04387
- 240万多模态实例和400条指令,涵盖40个任务和80种语言
- MIMIC-IT: 多模态上下文指令微调: https://arxiv.org/abs/2306.05425
- MultiInstruct: https://github.com/VT-NLP/MultiInstruct
- COLLIE: https://github.com/princeton-nlp/Collie
- Mind2Web: 面向网络的通用智能体 https://osu-nlp-group.github.io/Mind2Web/
- Android in the Wild: 大规模Android设备控制数据集: https://github.com/google-research/google-research/tree/master/android_in_the_wild
- FLASK: 基于对齐技能集的细粒度语言模型评估 https://github.com/kaistAI/FLASK
- Safe-RLHF: https://arxiv.org/abs/2310.12773
- HelpSteer: https://huggingface.co/datasets/nvidia/HelpSteer
次优标准/使用语言模型生成
- Self-Instruct: https://github.com/yizhongw/self-instruct
- Unnatural Instructions: https://github.com/orhonovich/unnatural-instructions
- Alpaca: https://huggingface.co/datasets/tatsu-lab/alpaca
- Alpaca-Clean: https://github.com/gururise/AlpacaDataCleaned
- Code Alpaca: https://github.com/sahil280114/codealpaca
- AlpacaGPT3.5Customized: https://huggingface.co/datasets/whitefox44/AlpacaGPT3.5Customized
- GPT4All: https://github.com/nomic-ai/gpt4all
- GPT4All-pruned: https://huggingface.co/datasets/Nebulous/gpt4all_pruned
- ShareGPT: https://huggingface.co/datasets/RyokoAI/ShareGPT52K
- GPTeacher: https://github.com/teknium1/GPTeacher
- CAMEL🐪: https://www.camel-ai.org/
- 人类与ChatGPT对比语料库: https://github.com/Hello-SimpleAI/chatgpt-comparison-detection
- InstructionWild: https://github.com/XueFuzhao/InstructionWild
- 使用GPT-4进行指令调优: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
- Guanaco: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
- LongForm数据集: https://github.com/akoksal/LongForm/tree/main/dataset
- 为多样化语料样本生成LLM指令(27,739对指令和长文本)
- UltraChat: https://huggingface.co/datasets/stingning/ultrachat
- LLaVA视觉指令150K: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
- GPT生成的多模态指令执行数据
- GPT4Tools: https://github.com/StevenGrove/GPT4Tools
- 用于调用多个多模态模型API的指令数据
- LaMini-Instruction: https://huggingface.co/datasets/MBZUAI/LaMini-instruction
- 258万对指令和回复
- Evol-Instruct 70k: https://github.com/nlpxucan/WizardLM
- Dynosaur: https://dynosaur-it.github.io/
- Alpaca-Farm: https://github.com/tatsu-lab/alpaca_farm
- ign_clean_instruct_dataset_500k: https://huggingface.co/datasets/ignmilton/ign_clean_instruct_dataset_500k
- airoboros: https://github.com/jondurbin/airoboros
- UltraFeedback: https://huggingface.co/datasets/openbmb/UltraFeedback
- WildChat: 570K真实用户-ChatGPT交互语料 https://wildchat.allen.ai/
- 反馈收集: https://arxiv.org/abs/2310.08491
偏好数据集(可用于训练奖励模型)
- HH-RLHF: https://huggingface.co/datasets/Anthropic/hh-rlhf
- 包含人类对模型输出的有害性和有用性评分。该数据集包含约16万个人工评分示例,每个示例由聊天机器人的一对回复组成,其中一个是人类偏好的。
- OpenAI WebGPT: https://huggingface.co/datasets/openai/webgpt_comparisons
- 包含约2万个比较,每个示例包括一个问题、一对模型答案和元数据。答案由人类根据偏好评分。
- OpenAI总结: https://huggingface.co/datasets/openai/summarize_from_feedback
- 包含约9.3万个示例,每个示例包含人类对模型生成摘要的反馈。人类评估员从两个选项中选择更优秀的摘要。
- 斯坦福人类偏好数据集(SHP): https://huggingface.co/datasets/stanfordnlp/SHP
- 38.5万个人类对18个不同主题的问题/指令回复的集体偏好
- Stack Exchange偏好: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
- SLF5K: https://huggingface.co/datasets/JeremyAlain/SLF5K
- qa-from-hf: https://github.com/lil-lab/qa-from-hf
- Nectar: https://huggingface.co/datasets/berkeley-nest/Nectar
- JudgeLM-100K: https://huggingface.co/datasets/BAAI/JudgeLM-100K
- UltraFeedback: https://huggingface.co/datasets/openbmb/UltraFeedback
杂项
- OIG: https://huggingface.co/datasets/laion/OIG
- 这里一些数据集的超集
- oa_leet10k: https://huggingface.co/datasets/ehartford/oa_leet10k
- 用多种编程语言解决的LeetCode问题
- ProSocial对话: https://huggingface.co/datasets/allenai/prosocial-dialog
- ConvoKit: https://convokit.cornell.edu/documentation/datasets.html
- CoT-Collection: https://github.com/kaist-lklab/CoT-Collection
- DialogStudio: https://github.com/salesforce/DialogStudio
- Chatbot Arena对话 https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
- lmsys 1M: https://huggingface.co/datasets/lmsys/lmsys-chat-1m
- 对话编年史: https://conversation-chronicles.github.io/