Confusion about past_key_values and attention_mask in GPT2Attention #16811

wiio12 · 2022-04-17T17:59:48Z

Enviorment info

transformers version: 4.12.5

Models:

GPT-2, GPT: @patil-suraj, @patrickvonplaten, @LysandreJik

Infomation

When I read through the code in modeling_gpt2, I got confused about how attention_mask is used. Here, the code concatenates the past key and value into the current hidden_state's key and value. Here's the code in modeling_gpt2.GPT2Attention's forward method:

        query = self._split_heads(query, self.num_heads, self.head_dim)
        key = self._split_heads(key, self.num_heads, self.head_dim)
        value = self._split_heads(value, self.num_heads, self.head_dim)

        if layer_past is not None:
            past_key, past_value = layer_past
            key = torch.cat((past_key, key), dim=-2)
            value = torch.cat((past_value, value), dim=-2)

        if use_cache is True:
            present = (key, value)
        else:
            present = None

        if self.reorder_and_upcast_attn:
            attn_output, attn_weights = self._upcast_and_reordered_attn(query, key, value, attention_mask, head_mask)
        else:
            attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)

However, later in the self._attn function, when using an attention_mask, the code directly adds the attention_mask to the attention weight. Here's the code in the self._attn method:

def _attn(self, query, key, value, attention_mask=None, head_mask=None):
        attn_weights = torch.matmul(query, key.transpose(-1, -2))

        if self.scale_attn_weights:
            attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)

        # Layer-wise attention scaling
        if self.scale_attn_by_inverse_layer_idx:
            attn_weights = attn_weights / float(self.layer_idx + 1)

        if not self.is_cross_attention:
            # if only "normal" attention layer implements causal mask
            query_length, key_length = query.size(-2), key.size(-2)
            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
            attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))

        if attention_mask is not None:
            # Apply the attention mask
            attn_weights = attn_weights + attention_mask

        attn_weights = nn.Softmax(dim=-1)(attn_weights)

The attn_weights has the shape of [batch, n_head, query_length, key_length], and attention_mask here has the shape of [batch, 1, 1, seq_length]. Does this action imply that the input attention mask's seq_length must match the full context length key_length instead of query_length? In other words, when we use past_key_and_values, the attention_mask must contain sequences from past_key_and_values and input_ids instead of only the sequences from input_ids?

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-04-18T17:48:33Z

Great question @wiio12!

You're exactly right attention_mask needs to contain the masking strategy that was used for past_key_values. In other words, the attention_mask always has to have the length: len(past_key_values) + len(input_ids)

wiio12 · 2022-04-19T03:59:56Z

Thank you for your response @patrickvonplaten, very clear! Now I am sure how past_key_value and attention_mask work.

I wonder if this constraint is mentioned in any documentation, otherwise, the user may get an error with dimension mismatch but not know why this happens.

patrickvonplaten · 2022-04-19T08:38:13Z

Think it'd be a good idea to document this somewhere! Would you like to add a sentence to the documentation of the attention_mask parameter in GPT2?

wiio12 · 2022-04-19T11:14:00Z

Not sure I did it correctly, but I change the doc_string in modeling_gpt2 and modeling_tf_gpt2 and make a PR #16829.

Correct me if I did it wrong :)

patrickvonplaten · 2022-04-19T14:25:24Z

Looks great!

wiio12 mentioned this issue Apr 19, 2022

Add doc about attention_mask on gpt2 #16829

Merged

5 tasks

patrickvonplaten closed this as completed in #16829 Apr 19, 2022

ArthurZucker mentioned this issue Jul 11, 2023

past_key_values supporting more-than-one-token inputs #24741

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about past_key_values and attention_mask in GPT2Attention #16811

Confusion about past_key_values and attention_mask in GPT2Attention #16811

wiio12 commented Apr 17, 2022

patrickvonplaten commented Apr 18, 2022

wiio12 commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022 •

edited

Loading

wiio12 commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

Confusion about past_key_values and attention_mask in GPT2Attention #16811

Confusion about past_key_values and attention_mask in GPT2Attention #16811

Comments

wiio12 commented Apr 17, 2022

Enviorment info

Infomation

patrickvonplaten commented Apr 18, 2022

wiio12 commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022 • edited Loading

wiio12 commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022 •

edited

Loading