BatchNorm training instability fix #675

andrewdipper · 2024-03-07T02:23:59Z

This is in reference to issue 659.

I modified BatchNorm to have two approaches "batch" and "ema". "batch" just uses the batch statistics during training time. If approach is not specified it defaults to "batch" with a warning. It's robust and seems to be the standard choice - it's far less likely to kill a model just by adding it.

"ema" is based of the smooth start method in the above issue. So keep a running mean and variance but instead of renormalizing Adam style the parts of the running averages that are zeroed are filled with the batch statistics. The problem is it's still not robust - the momentum parameter is simultaneously specifying a warmup period (when we're expecting the input distribution to change significantly) and how long we want the running average to be. So I added a linear warmup period.

Now for any choice of momentum there seems to be a warmup_period choice that will give good results. And validation performance was at least as good as with batch mode for my tests. However, I don't see a good default for warmup_period.

Some considerations:

having approach="batch" and the common axis_name="batch" is a little awkward
There's an example using BatchNorm - that will start raising a warning and should probably get changed
The current BatchNorm behavior can't be exactly replicated (ema / momentum=0.99 / warmup_period=1) is close but different at the start
There's one more piece of state hence the test_stateful.py change. Though this could be conditionally removed for approach="batch" if desired

Let me know what you think or if any changes or tests need to be added

…bility when using running statistics.

patrick-kidger · 2024-03-10T08:23:14Z

equinox/nn/_batch_norm.py

 self.inference = inference
 self.axis_name = axis_name
 self.input_size = input_size
 self.eps = eps
 self.channelwise_affine = channelwise_affine
 self.momentum = momentum
+ self.warmup_period = max(1, warmup_period)


Why the max? Perhaps it would be better to just error out on values that are too small?

warmup_period=0 seemed natural for off - Changed to just check and error out

patrick-kidger · 2024-03-10T08:24:01Z

equinox/nn/_batch_norm.py

 self.inference = inference
 self.axis_name = axis_name
 self.input_size = input_size
 self.eps = eps
 self.channelwise_affine = channelwise_affine
 self.momentum = momentum
+ self.warmup_period = max(1, warmup_period)

 @jax.named_scope("eqx.nn.BatchNorm")
 def __call__(


It's not completely obvious to me that the ema implementation, with default arguments, reproduces the previous behaviour. (For example, we have warmup_period=1000 by default?)

Can you add some comments explaining what each approach corresponds to?

ema with warmup_period=1 approximately reproduces previous behavior. As I noted the start is different because of how the running statistics are initially populated. With warmup_period=1 there's no interpolation between the batch and running stats - the running stats are always used as with the previous behavior. I can give an exact replication with an extra approach if necessary.

Added some to the documentation

I think an exact replication is probably important for the default behaviour, just because I'd like to be sure that we're bit-for-bit backward compatible.

Makes sense, it was different enough that I added it as "ema_compatibility". I changed the warning to rather strongly recommend against using "ema_compatibility". I haven't found a use case where I wouldn't expect to see the instability (at least with a larger learning rate) but that could very much be due to a lack of imagination on my part.. That part can definitely change if needed

patrick-kidger · 2024-03-10T08:26:39Z

equinox/nn/_batch_norm.py

- state = state.set(self.first_time_index, jnp.array(False))
+ momentum = self.momentum
+ zero_frac = state.get(self.zero_frac_index)
+ zero_frac *= momentum


Stylistic nit: I tend not to use the inplace operations in JAX code. This (a) fits with the functional style a bit better, and (b) emphasises that we're definitely falling back to the zero_frac = zero_frac * momentum interpretation of the syntax. (Gosh, why does Python has two different meanings for the same syntax?)

makes sense, done

patrick-kidger · 2024-03-10T08:27:19Z

equinox/nn/_batch_norm.py


 batch_mean, batch_var = jax.vmap(_stats)(x)
 running_mean, running_var = state.get(self.state_index)
- momentum = self.momentum
 running_mean = (1 - momentum) * batch_mean + momentum * running_mean
 running_var = (1 - momentum) * batch_var + momentum * running_var


These don't appear to be used on the batch branch. I think the lines here can be reorganised to keep each approach only using the things it needs.

these are used by the batch branch when we're in inference mode so they still need to be computed and stored

patrick-kidger · 2024-03-10T08:29:36Z

equinox/nn/_batch_norm.py

+ warmup_count = state.get(self.count_index)
+ warmup_count = jnp.minimum(warmup_count + 1, self.warmup_period)
+ state = state.set(self.count_index, warmup_count)
+
+ warmup_frac = warmup_count / self.warmup_period
+ norm_mean = zero_frac * batch_mean + running_mean
+ norm_mean = (1.0 - warmup_frac) * batch_mean + warmup_frac * norm_mean
+ norm_var = zero_frac * batch_var + running_var
+ norm_var = (1.0 - warmup_frac) * batch_var + warmup_frac * norm_var


I'm definitely going to have to sit down and grok what's going on here more carefully! As above it would be good to have some comments / docstrings / references / etc. describing what each approach is meant to do.

(C.f. something like the MultiheadAttention docstring for an example on how to use LaTeX if it'd be helpful.)

Added some commentary and tried making it a bit cleaner.

But overall batch mode should follow the cited paper. Ema follows the prior behavior but changes the initialization of the running stats and adds interpolation so it can be stable while training.

andrewdipper · 2024-03-14T20:21:39Z

equinox/nn/_batch_norm.py

+ debias_coef = (axis_size) / jnp.maximum(axis_size - 1, self.eps)
+ running_var = (
+ 1 - momentum
+ ) * debias_coef * batch_var + momentum * running_var


I neglected to use unbiased variance so corrected that here

andrewdipper · 2024-03-14T20:22:36Z

equinox/nn/_batch_norm.py

@@ -202,8 +259,15 @@ def _stats(y):
 norm_var = zero_frac * batch_var + running_var
 norm_var = (1.0 - warmup_frac) * batch_var + warmup_frac * norm_var
 else:
+ axis_size = jax.lax.psum(jnp.array(1.0), self.axis_name)


I'm using this to get the length of the "batch" axis - but not sure it's the best / correct way

I think this is the correct way! IIRC psum(1) is actually special-cased for this purpose.

patrick-kidger · 2024-03-23T22:03:02Z

equinox/nn/_batch_norm.py

+ - `approach`: The approach to use for the running statistics. If `approach=None`
+ a warning will be raised and approach will default to `"batch"`. During
+ training `"batch"` only uses batch statisics while`"ema"` uses the running
+ statistics.


So continuing from my previous comment -- probably the default should be ema if approach=None.

patrick-kidger

Okay! Sorry for taking so long to getting back around to reviewing this.

Lmk once you're happy that the previous behaviour is replicated by default, and I'll sit down with a pen and paper and satisfy myself that the calculations all look reasonable!

andrewdipper · 2024-04-01T05:26:19Z

All good - I got caught up in other things myself!

From my tests the replication is exact now. It added another approach that is very similar to "ema" but it seemed like the most reasonable way to organize it. Let me know if anything isn't clear

andrewdipper added 2 commits March 6, 2024 16:00

Update BatchNorm to add mode for using batch statisics / increase sta…

346e472

…bility when using running statistics.

fix batchnorm type definiton

cc021f6

patrick-kidger reviewed Mar 10, 2024

View reviewed changes

andrewdipper added 2 commits March 14, 2024 12:44

use unbiased variance + minor fixes

6bfa77f

reformat + comments

e7705b6

andrewdipper commented Mar 14, 2024

View reviewed changes

andrewdipper requested a review from patrick-kidger March 15, 2024 06:45

patrick-kidger reviewed Mar 23, 2024

View reviewed changes

add BatchNorm backward compatibility mode

7a99c7f

fix description

96f4a96

andrewdipper requested a review from patrick-kidger April 1, 2024 15:07

patrick-kidger mentioned this pull request Apr 18, 2024

Possible bug in AdaptivePool #641

Open

patrick-kidger force-pushed the main branch 2 times, most recently from 2905390 to a386f27 Compare April 20, 2024 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm training instability fix #675

BatchNorm training instability fix #675

andrewdipper commented Mar 7, 2024

patrick-kidger Mar 10, 2024

andrewdipper Mar 14, 2024

patrick-kidger Mar 10, 2024

andrewdipper Mar 14, 2024

patrick-kidger Mar 23, 2024

andrewdipper Apr 1, 2024

patrick-kidger Mar 10, 2024

andrewdipper Mar 14, 2024

patrick-kidger Mar 10, 2024

andrewdipper Mar 14, 2024

patrick-kidger Mar 10, 2024

andrewdipper Mar 14, 2024

andrewdipper Mar 14, 2024

andrewdipper Mar 14, 2024

patrick-kidger Mar 23, 2024

patrick-kidger Mar 23, 2024

patrick-kidger left a comment

andrewdipper commented Apr 1, 2024

BatchNorm training instability fix #675

Are you sure you want to change the base?

BatchNorm training instability fix #675

Conversation

andrewdipper commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrick-kidger left a comment

Choose a reason for hiding this comment

andrewdipper commented Apr 1, 2024