Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackPACK with simple attention and additional layers #326

Open
nhianK opened this issue May 9, 2024 · 1 comment
Open

BackPACK with simple attention and additional layers #326

nhianK opened this issue May 9, 2024 · 1 comment

Comments

@nhianK
Copy link

nhianK commented May 9, 2024

I want to use backpack for computing per-sample gradients and was trying to understand the challenges of using a custom model that uses pytorch nn layers. For example, something like this architecture: https:/codingchild2424/MonaCoBERT/blob/master/src/models/monacobert.py

Some of the basic layers used for computing attention:
self.query = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.key = nn.Linear(hidden_size, self.all_head_size, bias=False) # 512 -> 256
self.value = nn.Linear(hidden_size, self.all_head_size, bias=False)

The model also has a trainable nn.Parameter:
self.gammas = nn.Parameter(torch.zeros(self.num_attention_heads, 1, 1))
And some convolutional layers.

What could be some of the challenges I might face while using a model like that and potential solutions to them? Is LayerNorm supported yet?

@fKunstner
Copy link
Collaborator

No a clean solution, but aiming for the minimum amount of code for it to work.

For the nn.Linear and convolutions layers, telling backpack to only extend the submodules, and that would let you extract the individual gradients for those. If you just want individual gradients for those, that'd be relatively easy (go through all the leaf-level module in your network, if it's a Linear of Conv, call extend on it)

The custom parameters like self.bias in SeparableConv1D and self.gammas in MonotonicConvolutionalMultiheadAttention are more tricky. To avoid having to write a custom gradient extraction code, you could rewrite them as a nn.Linear. That'd be less efficient, might be enough for experimentation.

For example instead of

def __init__():
    ...
    self.bias = nn.parameters(...)
    ...

def forward(...):
    ...
    x += self.bias
    ...

you could do

def __init__():
    ...
    self.dummy_linear = nn.Linear(...)
    self.dummy_linear.weight = Identity
    self.dummy_linear.weight.requires_grad = False
    ...

def forward(...):
    ...
    x = self.dummy_linear(x)
    ...

LayerNorm could also be implemented through a nn.Linear and keeping the weight matrix diagonal with linear.weight.data = torch.diag(torch.diag(linear.weight.data)) inplace in the forward before doing a matmult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants