Skip to content

Latest commit

 

History

History
38 lines (30 loc) · 3.33 KB

gpt2.md

File metadata and controls

38 lines (30 loc) · 3.33 KB

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

OpenAI, 2018

Major contribution

  • To demonstrate that language model begin to learn tasks without any explicit supervision in zero shot setting.
  • The capacity of language model is essential to the success of zero shot task transfer
  • GPT-2 1.5B parameter transformer achieves state of the art results on 7 out of 8 tested language modelling datasets
  • Conditioned on a document plus questionm the answers generated by language model reach 55 F1 on CoQA, matching or exceeding the performance of 3 out of 4 baseline systems, without using 127000+ training examples.

Training Dataset preperation

  • WebText from CommonCrawl, by scrapping only those pages which are outbound links from reddit and recieved atleast 3 karma. Deduplication and some heuristic cleaning makes it 8 millions documents of total 40GB.

Byte Pair Encoding:

  • It leverages BPE for a middle ground between character and word level language modelling.
  • BPE was prevend to merge from across character categories for any byte sequence, and with an exception of space.

Architecture:

  • Transformer decoder architecture, similar to GPT.
  • Layer normalization was moved to input to each sub block and an additional layer normalization after final self attention block.
  • A modified initialiation which accounts for accumulation on the residual path with model depth, is used. So, scaling weights of residual layers at initialization by a factor of 1/sqrt(N)
  • Vocabolary is expanded to 50257 tokens.
  • Context size of 1024 instead of 512 in GPT.
  • Large batch size of 512 is used.

Results:

  • Language models: SOTA on 7 out of 8 datasets.

  • Children's Book Test: Performance of LM on different categories of words, i.e. nouns, adjectives, verbs etc. Task is to predict 10 possible choices for a cloze test (i.e. ommitted word). SOTA results

  • LAMBADA: Task is to predict final word of sentences which requires atleast 50 tokens context for human to successfully predict. Improving overall SOTA by 4% (with additional constraint of final word)

  • Winograd Schema challange to perform commonsense reasoning by resolving ambiguities in text. LMs are used to resolve ambiguity by predicting the tokens with higher probabilties. GPT-2 achieves SOTA performance here.

  • Reading Comprehension: Greedy decoding from GPT-2 when conditioned on a document, the history of the associated conversation and a final token A. achieves 55 F1, exceeds performance of 3 out of 4 baseline systems.

  • Summarisation: use "TL;DR" token to induce summaries. With top-2 random sampling, generate 100 tokens, and use first three sentences. On ROUGE score, it begins to approach performance of classic neural baselines.

  • Translation: condition model on context of english and french sentences, with pairs like.. english sentence = french sentence. BLUE score is much worse than SOTA, but still model began to learn.

  • Question Answering: Context of language model is seeded with example question answer pairs, which helps model infer the short answer style of the dataset. Performance of GPT-2 is still much worse then SOTA on open domain question answering, but as model size gets bigger, there is improvement in learning.

  • Unsupervised task learning is an additional promising area of research to explore.