ML Papers Explained

Explanations to key concepts in ML

Language Models

Paper	Date	Description
Transformer	June 2017	An Encoder Decoder model, that introduced multihead attention mechanism for language translation task.
Elmo	February 2018	Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts.
Marian MT	April 2018	A Neural Machine Translation framework written entirely in C++ with minimal dependencies, designed for high training and translation speed.
GPT	June 2018	A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations.
BERT	October 2018	Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks.
Transformer XL	January 2019	Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism.
XLM	January 2019	Proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective.
GPT 2	February 2019	Demonstrates that language models begin to learn various language processing tasks without any explicit supervision.
Sparse Transformer	April 2019	Introduced sparse factorizations of the attention matrix to reduce the time and memory consumption to O(n√ n) in terms of sequence lengths.
UniLM	May 2019	Utilizes a shared Transformer network and specific self-attention masks to excel in both language understanding and generation tasks.
XLNet	June 2019	Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives.
RoBERTa	July 2019	Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks .
Sentence BERT	August 2019	A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity.
CTRL	September 2019	A 1.63B language model that can generate text conditioned on control codes that govern style, content, and task-specific behavior, allowing for more explicit control over text generation.
Tiny BERT	September 2019	Uses attention transfer, and task specific distillation for distilling BERT.
ALBERT	September 2019	Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT.
Distil BERT	October 2019	Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective.
T5	October 2019	A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format.
BART	October 2019	An Encoder-Decoder pretrained to reconstruct the original text from corrupted versions of it.
XLM-Roberta	November 2019	A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks.
XLM-Roberta	November 2019	A multilingual masked language model pre-trained on text in 100 languages, shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks.
Pegasus	December 2019	A self-supervised pre-training objective for abstractive text summarization, proposes removing/masking important sentences from an input document and generating them together as one output sequence.
Reformer	January 2020	Improves the efficiency of Transformers by replacing dot-product attention with locality-sensitive hashing (O(Llog L) complexity), using reversible residual layers to store activations only once, and splitting feed-forward layer activations into chunks, allowing it to perform on par with Transformer models while being much more memory-efficient and faster on long sequences.
mBART	January 2020	A multilingual sequence-to-sequence denoising auto-encoder that pre-trains a complete autoregressive model on large-scale monolingual corpora across many languages using the BART objective, achieving significant performance gains in machine translation tasks.
UniLMv2	February 2020	Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks,significantly advancing the capabilities of language models in diverse NLP tasks.
ELECTRA	March 2020	Proposes a sample-efficient pre-training task called replaced token detection, which corrupts input by replacing some tokens with plausible alternatives and trains a discriminative model to predict whether each token was replaced or no.
FastBERT	April 2020	A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs.
MobileBERT	April 2020	Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer.
Longformer	April 2020	Introduces a linearly scalable attention mechanism, allowing handling texts of exteded length.
GPT 3	May 2020	Demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance.
DeBERTa	June 2020	Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training.
DeBERTa v2	June 2020	Enhanced version of the DeBERTa featuring a new vocabulary, nGiE integration, optimized attention mechanisms, additional model sizes, and improved tokenization.
T5 v1.1	July 2020	An enhanced version of the original T5 model, featuring improvements such as GEGLU activation, no dropout in pre-training, exclusive pre-training on C4, no parameter sharing between embedding and classifier layers.
mT5	October 2020	A multilingual variant of T5 based on T5 v1.1, pre-trained on a new Common Crawl-based dataset covering 101 languages (mC4).
Codex	July 2021	A GPT language model finetuned on publicly available code from GitHub.
FLAN	September 2021	An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions.
T0	October 2021	A fine tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets.
DeBERTa V3	November 2021	Enhances the DeBERTa architecture by introducing replaced token detection (RTD) instead of mask language modeling (MLM), along with a novel gradient-disentangled embedding sharing method, exhibiting superior performance across various natural language understanding tasks.
WebGPT	December 2021	A fine-tuned GPT-3 model utilizing text-based web browsing, trained via imitation learning and human feedback, enhancing its ability to answer long-form questions with factual accuracy.
Gopher	December 2021	Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks.
LaMDA	January 2022	Transformer based models specialized for dialog, which are pre-trained on public dialog data and web text.
Instruct GPT	March 2022	Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent.
CodeGen	March 2022	An LLM trained for program synthesis using input-output examples and natural language descriptions.
Chinchilla	March 2022	Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws).
PaLM	April 2022	A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods).
GPT-NeoX-20B	April 2022	An autoregressive LLM trained on the Pile, and the largest dense model that had publicly available weights at the time of submission.
OPT	May 2022	A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3.
Flan T5, Flan PaLM	October 2022	Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine tuning on chain-of-thought data.
BLOOM	November 2022	A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology.
BLOOMZ, mT0	November 2022	Applies Multitask prompted fine tuning to the pretrained multilingual models on English tasks with English prompts to attain task generalization to non-English languages that appear only in the pretraining corpus.
Galactica	November 2022	An LLM trained on scientific data thus specializing in scientific knowledge.
ChatGPT	November 2022	An interactive model designed to engage in conversations, built on top of GPT 3.5.
Self Instruct	December 2022	A framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations.
LLaMA	February 2023	A collection of foundation LLMs by Meta ranging from 7B to 65B parameters, trained using publicly available datasets exclusively.
Toolformer	February 2023	An LLM trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.
Alpaca	March 2023	A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
GPT 4	March 2023	A multimodal transformer model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs.
Vicuna	March 2023	A 13B LLaMA chatbot fine tuned on user-shared conversations collected from ShareGPT, capable of generating more detailed and well-structured answers compared to Alpaca.
BloombergGPT	March 2023	A 50B language model train on general purpose and domain specific data to support a wide range of tasks within the financial industry.

ML-Papers-Explained

ML Papers Explained

Language Models