Improving Language Understanding by Generative Pre-Training
-
GPT (Generative Pre-Training) is semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning.
- Goal: It can learn a universal representation that transfers with little adaptation to a wide range of tasks.
- Assumption: We have a large corpus of unlabeled text and several annotated training sets.
2. Two-stage training procedure
- Unsupervised pre-training : Use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model (This paper selects the Transformer (Vaswani et al., 2017) as its model architecture).
- supervised fine-tuning: Adapt these parameters to a target task using the corresponding supervised objective.
3. Unsupervised pre-training
Given an unsupervised corpus of tokens .
- A multi-layer Transformer applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:
where is the context vector of tokens, is the number of layerts, is the token embedding matrix, and is the position embedding matrix. - The objective is to maximize the following likelihood :
where is the size of the context window and is the model's parameters.
4. Supervised fine-tuning
Given a labeled dataset where each instance is a sequence of input tokens along with a label .
- Pass the inputs through the pre-trained model to get the and then fed into an added linear output layer with parameters to predict :
- The objective is to maximize the following likelihood :
- Including the language modeling as auxiliary objective to the fine-tuning can not only improve the generalization of the supervised model, but also accelerate convergence during training.
5. Task-specific input transformations
All following transformations include adding randomly initialized start and end tokens (, ).
Reference
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).