A Survey on Data Selection for Language Models

GitHub stars GitHub forks

This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!

For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models. By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

A conceptual demonstration of the data pipeline for language model training

Data Selection for Pretraining
Data Selection for Instruction-Tuning and Multitask Training
Data Selection for Preference Fine-tuning Alignment
Data Selection for In-Context Learning
Data Selection for Task-specific Fine-tuning

Data Selection for Pretraining

Language Filtering

Back to Table of Contents

FastText.zip: Compressing text classification models: 2016
Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov
Learning Word Vectors for 157 Languages: 2018
Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas
Cross-lingual Language Model Pretraining: 2019
Conneau, Alexis and Lample, Guillaume
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020
Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.
Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus: 2020
Caswell, Isaac and Breiner, Theresa and van Esch, Daan and Bapna, Ankur
Unsupervised Cross-lingual Representation Learning at Scale: 2020
Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman... 4 hidden ... Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020
Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard
A reproduction of Apple's bi-directional LSTM models for language identification in short strings: 2021
Toftrup, Mads and Asger Sorensen, Soren and Ciosici, Manuel R. and Assent, Ira
Evaluating Large Language Models Trained on Code: 2021
Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021
Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin
Competition-level code generation with AlphaCode: 2022
Li, Yujia and Choi, David and Chung, Junyoung... 20 hidden ... de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol
PaLM: Scaling Language Modeling with Pathways: 2022
Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022
Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine
Writing System and Speaker Metadata for 2,800+ Language Varieties: 2022
van Esch, Daan and Lucassen, Tamar and Ruder, Sebastian and Caswell, Isaac and Rivera, Clara
FinGPT: Large Generative Models for a Small Language: 2023
Luukkonen, Risto and Komulainen, Ville and Luoma, Jouni... 5 hidden ... Muennighoff, Niklas and Piktus, Aleksandra and others
MC^ 2: A Multilingual Corpus of Minority Languages in China: 2023
Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong
Madlad-400: A multilingual and document-level large audited dataset: 2023
Kudugunta, Sneha and Caswell, Isaac and Zhang, Biao... 5 hidden ... Stella, Romi and Bapna, Ankur and others
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023
Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024
Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo

Heuristic Approaches

Back to Table of Contents

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020
Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.
Language Models are Few-Shot Learners: 2020
Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario
The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020
Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy
Evaluating Large Language Models Trained on Code: 2021
Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021
Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin
Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022
Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022
Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine
HTLM: Hyper-Text Pre-Training and Prompting of Language Models: 2022
Armen Aghajanyan and Dmytro Okhonko and Mike Lewis... 1 hidden ... Hu Xu and Gargi Ghosh and Luke Zettlemoyer
LLaMA: Open and Efficient Foundation Language Models: 2023
Hugo Touvron and Thibaut Lavril and Gautier Izacard... 8 hidden ... Armand Joulin and Edouard Grave and Guillaume Lample
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023
Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay
The foundation model transparency index: 2023
Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne... 2 hidden ... Xiong, Betty and Zhang, Daniel and Liang, Percy
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024
Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo

Data Quality

Back to Table of Contents

KenLM: Faster and Smaller Language Model Queries: 2011
Heafield, Kenneth
FastText.zip: Compressing text classification models: 2016
Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov
Learning Word Vectors for 157 Languages: 2018
Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas
Language Models are Unsupervised Multitask Learners: 2019
Alec Radford and Jeff Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever
Language Models are Few-Shot Learners: 2020
Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario
The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020
Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020
Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard
Detoxifying language models risks marginalizing minority voices: 2021
Xu, Albert and Pathak, Eshaan and Wallace, Eric and Gururangan, Suchin and Sap, Maarten and Klein, Dan
PaLM: Scaling Language Modeling with Pathways: 2022
Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel
Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022
Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection: 2022
Gururangan, Suchin and Card, Dallas and Dreier, Sarah... 2 hidden ... Wang, Zeyu and Zettlemoyer, Luke and Smith, Noah A.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts: 2022
Du, Nan and Huang, Yanping and Dai, Andrew M... 21 hidden ... Wu, Yonghui and Chen, Zhifeng and Cui,

data-selection-survey

A Survey on Data Selection for Language Models

Table of Contents

Data Selection for Pretraining

Language Filtering

Heuristic Approaches

Data Quality