A Survey on Data Selection for Language Models
This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!
For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models. By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
Table of Contents
- Data Selection for Pretraining
- Data Selection for Instruction-Tuning and Multitask Training
- Data Selection for Preference Fine-tuning Alignment
- Data Selection for In-Context Learning
- Data Selection for Task-specific Fine-tuning
Data Selection for Pretraining
Language Filtering
- FastText.zip: Compressing text classification models: 2016
Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov - Learning Word Vectors for 157 Languages: 2018
Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas - Cross-lingual Language Model Pretraining: 2019
Conneau, Alexis and Lample, Guillaume - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020
Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J. - Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus: 2020
Caswell, Isaac and Breiner, Theresa and van Esch, Daan and Bapna, Ankur - Unsupervised Cross-lingual Representation Learning at Scale: 2020
Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman... 4 hidden ... Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020
Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard - A reproduction of Apple's bi-directional LSTM models for language identification in short strings: 2021
Toftrup, Mads and Asger Sorensen, Soren and Ciosici, Manuel R. and Assent, Ira - Evaluating Large Language Models Trained on Code: 2021
Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba - mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021
Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin - Competition-level code generation with AlphaCode: 2022
Li, Yujia and Choi, David and Chung, Junyoung... 20 hidden ... de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol - PaLM: Scaling Language Modeling with Pathways: 2022
Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel - The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022
Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine - Writing System and Speaker Metadata for 2,800+ Language Varieties: 2022
van Esch, Daan and Lucassen, Tamar and Ruder, Sebastian and Caswell, Isaac and Rivera, Clara - FinGPT: Large Generative Models for a Small Language: 2023
Luukkonen, Risto and Komulainen, Ville and Luoma, Jouni... 5 hidden ... Muennighoff, Niklas and Piktus, Aleksandra and others - MC^ 2: A Multilingual Corpus of Minority Languages in China: 2023
Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong - Madlad-400: A multilingual and document-level large audited dataset: 2023
Kudugunta, Sneha and Caswell, Isaac and Zhang, Biao... 5 hidden ... Stella, Romi and Bapna, Ankur and others - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023
Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024
Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo
Heuristic Approaches
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020
Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J. - Language Models are Few-Shot Learners: 2020
Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario - The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020
Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy - Evaluating Large Language Models Trained on Code: 2021
Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba - mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021
Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin - Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022
Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving - The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022
Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine - HTLM: Hyper-Text Pre-Training and Prompting of Language Models: 2022
Armen Aghajanyan and Dmytro Okhonko and Mike Lewis... 1 hidden ... Hu Xu and Gargi Ghosh and Luke Zettlemoyer - LLaMA: Open and Efficient Foundation Language Models: 2023
Hugo Touvron and Thibaut Lavril and Gautier Izacard... 8 hidden ... Armand Joulin and Edouard Grave and Guillaume Lample - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023
Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay - The foundation model transparency index: 2023
Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne... 2 hidden ... Xiong, Betty and Zhang, Daniel and Liang, Percy - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024
Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo
Data Quality
- KenLM: Faster and Smaller Language Model Queries: 2011
Heafield, Kenneth - FastText.zip: Compressing text classification models: 2016
Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov - Learning Word Vectors for 157 Languages: 2018
Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas - Language Models are Unsupervised Multitask Learners: 2019
Alec Radford and Jeff Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever - Language Models are Few-Shot Learners: 2020
Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario - The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020
Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020
Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard - Detoxifying language models risks marginalizing minority voices: 2021
Xu, Albert and Pathak, Eshaan and Wallace, Eric and Gururangan, Suchin and Sap, Maarten and Klein, Dan - PaLM: Scaling Language Modeling with Pathways: 2022
Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel - Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022
Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving - Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection: 2022
Gururangan, Suchin and Card, Dallas and Dreier, Sarah... 2 hidden ... Wang, Zeyu and Zettlemoyer, Luke and Smith, Noah A. - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts: 2022
Du, Nan and Huang, Yanping and Dai, Andrew M... 21 hidden ... Wu, Yonghui and Chen, Zhifeng and Cui,