LLMDataHub: Awesome Datasets for LLM Training
🔥 Alignment Datasets • 💡 Domain-specific Datasets • :atom: Pretraining Datasets 🖼️ Multimodal Datasets
Introduction 📄
Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.
Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.
Contact 📬
If you want to contribute, you can contact:
Junhao Zhao 📧
Advised by Prof. Wanyun Cui
General Open Access Datasets for Alignment 🟢:
Type Tags 🏷️:
- SFT: Supervised Finetune
- Dialog: Each entry contains continuous conversations
- Pairs: Each entry is an input-output pair
- Context: Each entry has a context text and related QA pairs
- PT: pretrain
- CoT: Chain-of-Thought Finetune
- RLHF: train reward model in Reinforcement Learning with Human Feedback
Datasets released in November 2023
Dataset name | Used by | Type | Language | Size | Description ️ |
---|---|---|---|---|---|
helpSteer | / | RLHF | English | 37k instances | An RLHF dataset that is annotated by human with helpfulness, correctness, coherence, complexity and verbosity measures |
no_robots | / | SFT | English | 10k instance | High-quality human-created STF data, single turn. |
Datasets released in September 2023
Dataset name | Used by | Type | Language | Size | Description ️ |
---|---|---|---|---|---|
Anthropic_ HH_Golden | ULMA | SFT / RLHF | English | train 42.5k + test 2.3k | Improved on the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. Using GPT4 to rewrite the original "chosen" answer. Compared with the original Harmless dataset, empirically this dataset improves the performance of RLHF, DPO or ULMA methods significantly on harmless metrics. |
Datasets released in August 2023
Dataset name | Used by | Type | Language | Size | Description ️ |
---|---|---|---|---|---|
function_ calling_ extended | / | Pairs | English code | / | High quality human created dataset from enhance LM's API using ability. |
AmericanStories | / | PT | English | / | Vast sized corpus scanned from US Library of Congress. |
dolma | OLMo | PT | / | 3T tokens | A large diverse open-source corpus for LM pretraining. |
Platypus | Platypus2 | Pairs | English | 25K | A very high quality dataset for improving LM's STEM reasoning ability. |
Puffin | Redmond-Puffin Series | Dialog | English | ~3k entries | A dataset consists of conversations between real human and GPT-4,which features long context (over 1k tokens per conversation) and multi-turn dialogs. |
tiny series | / | Pairs | English | / | A series of short and concise codes or texts aim at improving LM's reasoning ability. |
LongBench | / | Evaluation Only | English Chinese | 17 tasks | A benchmark for evaluate LLM's long context understanding capability. |
Datasets released in July 2023
Dataset name | Used by | Type | Language | Size | Description ️ |
---|---|---|---|---|---|
orca-chat | / | Dialog | English | 198,463 entries | An Orca-style dialog dataset aims at improving LM's long context conversational ability. |
DialogStudio | / | Dialog | Multilingual | / | A collection of diverse datasets aim at building conversational Chatbot. |
chatbot_arena _conversations | / | RLHF Dialog | Multilingual | 33k conversations | Cleaned conversations with pairwise human preferences collected on Chatbot Arena. |
WebGLM-qa | WebGLm | Pairs | English | 43.6k entries | Dataset used by WebGLM, which is a QA system based on LLM and Internet. Each of the entry in this dataset comprise a question, a response and a reference. The response is grounded in the reference. |
phi-1 | phi-1 | Dialog | English | / | A dataset generated by using the method in Textbooks Are All You Need. It focuses on math and CS problems. |
Linly- pretraining- dataset | Linly series | PT | Chinese | 3.4GB | Chinese pretraining dataset used by Linly series model, comprises ClueCorpusSmall, CSL news-crawl and etc. |
FineGrainedRLHF | / | RLHF | English | ~5K examples | A repo aims at develop a new framework to collect human feedbacks. Data collected is with the purpose to improve LLMs factual correctness, topic relevance and other abilities. |
dolphin | / | Pairs | English | 4.5M entries | An attempt to replicate Microsoft's Orca. Based on FLANv2. |
openchat_ sharegpt4_ dataset | OpenChat | Dialog | English | 6k dialogs | A high quality dataset generated by using GPT-4 to complete refined ShareGPT prompts. |
Datasets released in June 2023
Dataset name | Used by | Type | Language | Size | Description ️ |
---|---|---|---|---|---|
OpenOrca | / | Pairs | English | 4.5M completions | A collection of augmented FLAN data. Generated by using method is Orca paper. |
COIG-PC COIG-Lite | / | Pairs | Chinese | / | Enhanced version of COIG. |
WizardLM_Orca | orca_mini series | Pairs | English | 55K entries | Enhanced WizardLM data. Generated by using orca's method. |
arxiv instruct datasets math CS Physics | / | Pairs | English | 50K/ 50K/ 30K entries | dataset consists of question-answer pairs derived from ArXiv abstracts. Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. |
im-feeling- curious | / | Pairs | English | 2595 entries | Random questions and correspond facts generated by Google I'm feeling curious features. |
ign_clean _instruct _dataset_500k | / | Pairs | / | 509K entries | A large scale SFT dataset which is synthetically created from a subset of Ultrachat prompts. ⚠ lack of detailed datacard |
WizardLM evolve_instruct V2 | WizardLM | Dialog | English | 196k entries | The latest version of Evolve Instruct dataset. |
Dynosaur | / | Pairs | English | 800K entries | The dataset generated by applying method in this paper. Highlight is generating high-quality data at low cost. |
SlimPajama | / | PT | Primarily English | / | A cleaned and deduplicated version of RedPajama |
LIMA dataset | LIMA | Pairs | English | 1k entries | High quality SFT dataset used by LIMA: Less Is More for Alignment |
TigerBot Series | TigerBot | PT Pairs | Chinese English | / | Datasets used to train the TigerBot, including pretraining data, STF data and some domain specific datasets like financial research reports. |
TSI-v0 | / | Pairs | English | 30k examples per task | A Multi-task instruction-tuning data recasted from 475 of the tasksource datasets. Similar to Flan dataset and Natural instruction. |
NMBVC | / | PT | Chinese | / | A large scale, continuously updating Chinese pretraining dataset. |
StackOverflow post | / | PT | / | 35GB | Raw StackOverflow data in markdown format, for pretraining. |
Datasets released before June 2023
Dataset name | Used by | Type | Language | Size | Description ️ |
---|---|---|---|---|---|
LaMini-Instruction | / | Pairs | English | 2.8M entries | A dataset distilled from flan collection, p3 and self-instruction. |
ultraChat | / | Dialog | English | 1.57M dialogs | A large scale dialog dataset created by using two ChatGPT, one of which act as the user, another generates response. |
ShareGPT_ Vicuna_unfiltered | Vicuna | Pairs | Multilingual | 53K entries | Cleaned ShareGPT dataset. |
pku-saferlhf-dataset | Beaver | RLHF | English | 10K + 1M | The first dataset of its kind and contains 10k instances with safety preferences. |
RefGPT-Dataset nonofficial link | RefGPT | Pairs, Dialog | Chinese | ~50K entries | A Chinese dialog dataset aims at improve the correctness of fact in LLMs (mitigate the hallucination of LLM). |
Luotuo-QA-A CoQA-Chinese | Luotuo project | Context | Chinese | 127K QA pairs | A dataset built upon translated CoQA. Augmented by using OpenAI API. |
Wizard-LM-Chinese instruct-evol | Luotuo project | Pairs | Chinese | ~70K entries | Chinese version WizardLM 70K. Answers are obtained by feed translated questions in OpenAI's GPT API and then get responses. |
alpaca_chinese dataset | / | Pairs | Chinese | / | GPT-4 translated alpaca data includes some complement data (like Chinese poetry, application, |