Skip to content

Latest commit

 

History

History
55 lines (40 loc) · 3.38 KB

unilmv2.md

File metadata and controls

55 lines (40 loc) · 3.38 KB

UNILMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Hangbo Bao et al.

2020 [arXiv]

Whats Unique It invents an unified language model for both autoencoding (BERT like) and partially autoregressive language (XLNet for Span) modeling tasks using a novel training procedure referred to as a pseudo masked language modelling.

How It Works

  • Conventional masks to learn inter-relation between corrupted tokens and context via auto-encoding.
  • Pseudo masks learn intra-relations between masked spans via partially auto-regressive modeling.

Source: Author

  • Auto encoding objective remains conventional.
  • Partially auto regressive objective lets psedo masked span attend to prior predicted tokens, and corresponding masked tokens.

Following table gives an overview of how Auto Encoding, Auto Regressive objective, and Partially Auto Regressive objective.

Source: Author

Following figure shows pseudo masked language models for Unified model pre-training. It appends input sequence with pseudo masked tokens as well as original tokens for masked ones. And, with attention mechanism it trains model with dual objectives at the same time.

Source: Author

Model

  • Auto encoding loss

\mathcal{L}_{\mathrm{AE}}=-\sum_{x \in \mathcal{D}} \log \prod_{m \in M} p\left(x_{m} \mid x_{\backslash M}\right)

  • Partially Auto-regressive Modelling

    • In each factorization step, a model can predict one or multiple tokens.

    • Let M = < M1, M2, .. M_|M|> is factorization order, where M_i = {m_1, .., m_i}, or set of token span to be masked in factorisation step i

    • \begin{aligned}
p\left(x_{M} \mid x_{\backslash M}\right) &=\prod_{i=1}^{|M|} p\left(x_{M_{i}} \mid x_{\backslash M_{\geq i}}\right) \\
&=\prod_{i=1}^{|M|} \prod_{m \in M_{i}} p\left(x_{m} \mid x_{\backslash M \geq i}\right.
\end{aligned}

    • \mathcal{L}_{\mathrm{PAR}}=-\sum_{x \in \mathcal{D}} \mathbb{E}_{M} \log p\left(x_{M} \mid x_{\backslash M}\right)
  • Following figure illustrate the implementation details at attention mask level.

Source: Author