param_groups_lrd for layer decay #177

1119736939 · 2023-09-15T08:03:04Z

layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1)) in line 25 in lr_decay.py.
The elements in "layer_scales" are increasing, so the learning rates are also "the deeper the layer, the greater the learning rate". I printed the learning rate after execute the "lr_sched.adjust_learning_rate" function. It is "the deeper the layer, the greater the learning rate". But shouldn’t the deeper the layer, the smaller the learning rate. I'm so confused. Please answer my questions. Thanks.

alexlioralexli · 2023-12-18T21:30:00Z

The layers are indexed so that the first block (the one that is closest to the raw input) has index 0, and the last block (the one closest to predicting the logits) has index L - 1. So the later layers do correctly get a larger learning rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

param_groups_lrd for layer decay #177

param_groups_lrd for layer decay #177

1119736939 commented Sep 15, 2023

alexlioralexli commented Dec 18, 2023

param_groups_lrd for layer decay #177

param_groups_lrd for layer decay #177

Comments

1119736939 commented Sep 15, 2023

alexlioralexli commented Dec 18, 2023