Awesome Transformer & Transfer Learning in NLP

This repository contains a hand-curated list of great machine (deep) learning resources for Natural Language Processing (NLP) with a focus on Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), attention mechanism, Transformer architectures/networks, ChatGPT, and transfer learning in NLP.

Transformer (BERT encoder)

^{Transformer (Source)}

Expand Table of Contents

Papers
Articles
Educational
- Tutorials
AI Safety
Videos
- BERTology
- Attention and Transformer Networks
Official BERT Implementations
Transformer Implementations By Communities
- PyTorch and TensorFlow
- PyTorch
- Keras
- TensorFlow
- Chainer
- Other
Transfer Learning in NLP
Books
Other Resources
Tools
Tasks

Papers

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le and Ruslan Salakhutdinov.

Uses smart caching to improve the learning of long-term dependency in Transformer. Key results: state-of-art on 5 language modeling benchmarks, including ppl of 21.8 on One Billion Word (LM1B) and 0.99 on enwiki8. The authors claim that the method is more flexible, faster during evaluation (1874 times speedup), generalizes well on small datasets, and is effective at modeling short and long sequences.

Conditional BERT Contextual Augmentation by Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han and Songlin Hu.
SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering by Chenguang Zhu, Michael Zeng and Xuedong Huang.
Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.
The Evolved Transformer by David R. So, Chen Liang and Quoc V. Le.

They used architecture search to improve Transformer architecture. Key is to use evolution and seed initial population with Transformer itself. The architecture is better and more efficient, especially for small size models.

XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

A new pretraining method for NLP that significantly improves upon BERT on 20 tasks (e.g., SQuAD, GLUE, RACE).
"Transformer-XL is a shifted model (each hyper-column ends with next token) while XLNet is a direct model (each hyper-column ends with contextual representation of same token)." — Thomas Wolf.
Comments from HN:
A clever dual masking-and-caching algorithm.
- This is NOT "just throwing more compute" at the problem.
- The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from all possible permutations of the factorization order of all other tokens in the same input sequence.
- In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.
  - For example, if the input sequence has four tokens, ["The", "cat", "is", "furry"], in one training step the model will try to predict "is" after seeing "The", then "cat", then "furry".
  - In another training step, the model might see "furry" first, then "The", then "cat".
  - Note that the original sequence order is always retained, e.g., the model always knows that "furry" is the fourth token.
- The masking-and-caching algorithm that accomplishes this does not seem trivial to me.
- The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.

CTRL: Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar, Richard Socher et al. [Code].
PLMpapers - BERT (Transformer, transfer learning) has catalyzed research in pretrained language models (PLMs) and has sparked many extensions. This repo contains a list of papers on PLMs.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Google Brain.

The group perform a systematic study of transfer learning for NLP using a unified Text-to-Text Transfer Transformer (T5) model and push the limits to achieve SoTA on SuperGLUE (approaching human baseline), SQuAD, and CNN/DM benchmark. [Code].

Reformer: The Efficient Transformer by Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.

"They present techniques to reduce the time and memory complexity of Transformer, allowing batches of very long sequences (64K) to fit on one GPU. Should pave way for Transformer to be really impactful beyond NLP domain." — @hardmaru

Supervised Multimodal Bitransformers for Classifying Images and Text (MMBT) by Facebook AI.
A Primer in BERTology: What we know about how BERT works by Anna Rogers et al.

"Have you been drowning in BERT papers?". The group survey over 40 papers on BERT's linguistic knowledge, architecture tweaks, compression, multilinguality, and so on.

Key idea: the architecture use a subset of parameters on every training step and on each example. Upside: model train much faster. Downside: super large model that won't fit in a lot of environments.

An Attention Free Transformer by Apple.
A Survey of Transformers by Tianyang Lin et al.
Evaluating Large Language Models Trained on Code by OpenAI.

Codex, a GPT language model that powers GitHub Copilot.
They investigate their model limitations (and strengths).
They discuss the potential broader impacts of deploying powerful code generation techs, covering safety, security, and economics.

Training language models to follow instructions with human feedback by OpenAI. They call the resulting models InstructGPT. ChatGPT is a sibling model to InstructGPT.
LaMDA: Language Models for Dialog Applications by Google.
Training Compute-Optimal Large Language Models by Hoffmann et al. at DeepMind. TLDR: introduces a new 70B LM called "Chinchilla" that outperforms much bigger LMs (GPT-3, Gopher). DeepMind has found the secret to cheaply scale large language models — to be compute-optimal, model size and training data must be scaled equally. It shows that most LLMs are severely starved of data and under-trained. Given the new scaling law, even if you pump a quadrillion parameters into a model (GPT-4 urban myth), the gains will not compensate for 4x more training tokens.
Improving language models by retrieving from trillions of tokens by Borgeaud et al. at DeepMind - The group explore an alternate path for efficient training with Internet-scale retrieval. The method is known as RETRO, for "Retrieval Enhanced TRansfOrmers". With RETRO the model is not limited to the data seen during training – it has access to the entire training dataset through the retrieval mechanism. This results in significant performance gains compared to a standard Transformer with the same number of parameters. RETRO obtains comparable performance to GPT-3 on the Pile dataset, despite using 25 times fewer parameters. They show that language modeling improves continuously as they increase the size of the retrieval database. [blog post]
Scaling Instruction-Finetuned Language Models by Google - They find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks. Flan-PaLM 540B achieves SoTA performance on several benchmarks. They also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B.
Emergent Abilities of Large Language Models by Google Research, Stanford University, DeepMind, and UNC Chapel Hill.
Nonparametric Masked (NPM) Language Modeling by Meta AI et al. [code] - Nonparametric models with 500x fewer parameters outperform GPT-3 on zero-shot tasks.

It, crucially, does not have a softmax over a fixed output vocabulary, but instead has a fully nonparametric distribution over phrases. This is in contrast to a recent (2022) body of work that incorporates nonparametric components in a parametric model.

Results show that NPM is significantly more parameter-efficient, outperforming up to 500x larger parametric models and up to 37x larger retrieve-and-generate models.
Transformer models: an introduction and catalog by Xavier Amatriain, 2023 - The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.
Foundation Models for Decision Making: Problems, Methods, and Opportunities by Google Research et al., 2023 - A report of recent approaches (i.e., conditional generative modeling, RL, prompting) that ground pre-trained models (i.e., LMs) in practical decision making agents. Models can serve world dynamics or steer decisions.
GPT-4 Technical Report by OpenAI, 2023.
The Llama 3 Herd of Models by Llama Team, AI @ Meta, Jul 2024 - The paper, a oft-overlooked component of the project, proved to be just as vital, if not more so, than the model itself, and its significance came as a complete surprise. A masterpiece in its own right, the paper presented a treasure trove of detailed information on the model's pre-training and post-training processes, offering insights that were both profound and practical. [Discussion]