Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

param_groups_lrd for layer decay #177

Open
1119736939 opened this issue Sep 15, 2023 · 1 comment
Open

param_groups_lrd for layer decay #177

1119736939 opened this issue Sep 15, 2023 · 1 comment

Comments

@1119736939
Copy link

layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1)) in line 25 in lr_decay.py.
The elements in "layer_scales" are increasing, so the learning rates are also "the deeper the layer, the greater the learning rate". I printed the learning rate after execute the "lr_sched.adjust_learning_rate" function. It is "the deeper the layer, the greater the learning rate". But shouldn’t the deeper the layer, the smaller the learning rate. I'm so confused. Please answer my questions. Thanks.

@alexlioralexli
Copy link

The layers are indexed so that the first block (the one that is closest to the raw input) has index 0, and the last block (the one closest to predicting the logits) has index L - 1. So the later layers do correctly get a larger learning rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants