Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

OpenAI, 2018

Major contribution

To demonstrate that language model begin to learn tasks without any explicit supervision in zero shot setting.
The capacity of language model is essential to the success of zero shot task transfer
GPT-2 1.5B parameter transformer achieves state of the art results on 7 out of 8 tested language modelling datasets
Conditioned on a document plus questionm the answers generated by language model reach 55 F1 on CoQA, matching or exceeding the performance of 3 out of 4 baseline systems, without using 127000+ training examples.

Training Dataset preperation

WebText from CommonCrawl, by scrapping only those pages which are outbound links from reddit and recieved atleast 3 karma. Deduplication and some heuristic cleaning makes it 8 millions documents of total 40GB.

Byte Pair Encoding:

It leverages BPE for a middle ground between character and word level language modelling.
BPE was prevend to merge from across character categories for any byte sequence, and with an exception of space.

Architecture:

Transformer decoder architecture, similar to GPT.
Layer normalization was moved to input to each sub block and an additional layer normalization after final self attention block.
A modified initialiation which accounts for accumulation on the residual path with model depth, is used. So, scaling weights of residual layers at initialization by a factor of 1/sqrt(N)
Vocabolary is expanded to 50257 tokens.
Context size of 1024 instead of 512 in GPT.
Large batch size of 512 is used.

Results:

Language models: SOTA on 7 out of 8 datasets.
Children's Book Test: Performance of LM on different categories of words, i.e. nouns, adjectives, verbs etc. Task is to predict 10 possible choices for a cloze test (i.e. ommitted word). SOTA results
LAMBADA: Task is to predict final word of sentences which requires atleast 50 tokens context for human to successfully predict. Improving overall SOTA by 4% (with additional constraint of final word)
Winograd Schema challange to perform commonsense reasoning by resolving ambiguities in text. LMs are used to resolve ambiguity by predicting the tokens with higher probabilties. GPT-2 achieves SOTA performance here.
Reading Comprehension: Greedy decoding from GPT-2 when conditioned on a document, the history of the associated conversation and a final token A. achieves 55 F1, exceeds performance of 3 out of 4 baseline systems.
Summarisation: use "TL;DR" token to induce summaries. With top-2 random sampling, generate 100 tokens, and use first three sentences. On ROUGE score, it begins to approach performance of classic neural baselines.
Translation: condition model on context of english and french sentences, with pairs like.. english sentence = french sentence. BLUE score is much worse than SOTA, but still model began to learn.
Question Answering: Context of language model is seeded with example question answer pairs, which helps model infer the short answer style of the dataset. Performance of GPT-2 is still much worse then SOTA on open domain question answering, but as model size gets bigger, there is improvement in learning.
Unsupervised task learning is an additional promising area of research to explore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt2.md

gpt2.md

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

OpenAI, 2018

Files

gpt2.md

Latest commit

History

gpt2.md

File metadata and controls

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

OpenAI, 2018