Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy
Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020
Major Contributions:
-
Masking Random Contienous Spans:
- Continue till 15% tokens are maksed,
- sample a span length from geometric distribution
- randomly select starting point of the span to be masked
- Continue till 15% tokens are maksed,
-
Span Boundary Objective (SBO):
- To predict entire masked span from observed tokens at its boundary
- Encourage model to store span level information at its boundary, which is easily accessible during fine tuning.
- It replaces NSP objective
- Given masked tokens , it represent each token x_i in the span using the output encodings of external boundary tokens x_s-1, x_e+1, and positional encoding. So,
-
SBO function is implemented as 2-layer feedforward network with GELU activations and layer normalization.
-
Cross entropy loss is computed for predicted token in span, exactly like MLM.
-
Single Sequence BERT:
- Pre training on single segments, instead of two half-length segments with NSP.
- Do not use NSP objective
-
Objective Function:
- SpanBERT sums the loss from both SBO and MLM.
-
Example
- An example illustrating how SpanBERT works,