SpanBERT: Improving Pre-training by Representing and Predicting Spans

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy

Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020

Major Contributions:

Masking Random Contienous Spans:
- Continue till 15% tokens are maksed,
  - sample a span length from geometric distribution
  - randomly select starting point of the span to be masked
Span Boundary Objective (SBO):
- To predict entire masked span from observed tokens at its boundary
- Encourage model to store span level information at its boundary, which is easily accessible during fine tuning.
- It replaces NSP objective
- Given masked tokens $(x_s, ..., x_e) \in Y$ , it represent each token x_i in the span using the output encodings of external boundary tokens x_s-1, x_e+1, and positional encoding. So,
$y_i = f(x_{s-1}, x_{e+1}, P_{i - s+1})$
- SBO function is implemented as 2-layer feedforward network with GELU activations and layer normalization.
- Cross entropy loss is computed for predicted token in span, exactly like MLM.
Single Sequence BERT:
- Pre training on single segments, instead of two half-length segments with NSP.
- Do not use NSP objective
Objective Function:
- SpanBERT sums the loss from both SBO and MLM.
$\mathcal{L}(x_i) = \mathcal{L}_{MLM}(x_i)+\mathcal{L}_{SBO}(x_i)$
Example
- An example illustrating how SpanBERT works,

Source: Author

Provide feedback