Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook 07: Cannot save model checkpoint for FoodVision Big #449

Open
mrdbourke opened this issue Sep 12, 2022 · 3 comments
Open

Notebook 07: Cannot save model checkpoint for FoodVision Big #449

mrdbourke opened this issue Sep 12, 2022 · 3 comments

Comments

@mrdbourke
Copy link
Owner

Getting an error when training FoodVision Big:

# Fit the model with callbacks
history_101_food_classes_feature_extract = model.fit(train_data, 
                                                     epochs=3,
                                                     steps_per_epoch=len(train_data),
                                                     validation_data=test_data,
                                                     validation_steps=int(0.15 * len(test_data)),
                                                     callbacks=[create_tensorboard_callback("training_logs", 
                                                                                            "efficientnetb0_101_classes_all_data_feature_extract"),
                                                                model_checkpoint])
>>>WARNING:tensorflow:Can save best model only with val_acc available, skipping.

Looks like it's an issue with the model_checkpoint callback.

This causes the assertion for the cloned model later on to fail:

# Evalaute cloned model with loaded weights (should be same score as trained model)
results_cloned_model_with_loaded_weights = cloned_model.evaluate(test_data)
>>> ---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_1443486/1110829135.py in <module>
      1 # Loaded checkpoint weights should return very similar results to checkpoint weights prior to saving
      2 import numpy as np
----> 3 assert np.isclose(results_feature_extract_model, results_cloned_model_with_loaded_weights).all() # check if all elements in array are close

AssertionError: 

Need to update the model checkpoint to make sure it can save a model whilst training.

@mrdbourke
Copy link
Owner Author

Also getting this in Google Colab:

Screen Shot 2022-09-12 at 4 12 56 pm

@mrdbourke
Copy link
Owner Author

After troubleshooting this for a while, it seems there may be something up with the tf.keras.clone_model method.

What exactly, I'm not sure.

It could be due to the use of tf.keras.applications.efficientnet models (which are notorious for errors across TensorFlow versions.

In saying that, a fix I've found to demonstrate the "cloning" and loading of weights is to create a copy of the model by using the exact same code to create it:

# Create a function to recreate the original model
def create_model():
  # Create base model
  input_shape = (224, 224, 3)
  base_model = tf.keras.applications.efficientnet.EfficientNetB0(include_top=False)
  base_model.trainable = False # freeze base model layers

  # Create Functional model 
  inputs = layers.Input(shape=input_shape, name="input_layer")
  # Note: EfficientNetBX models have rescaling built-in but if your model didn't you could have a layer like below
  # x = layers.Rescaling(1./255)(x)
  x = base_model(inputs, training=False) # set base_model to inference mode only
  x = layers.GlobalAveragePooling2D(name="pooling_layer")(x)
  x = layers.Dense(len(class_names))(x) # want one output neuron per class 
  # Separate activation of output layer so we can output float32 activations
  outputs = layers.Activation("softmax", dtype=tf.float32, name="softmax_float32")(x) 
  model = tf.keras.Model(inputs, outputs)
  
  return model

# Create and compile a new version of the original model (new weights)
created_model = create_model()
created_model.compile(loss="sparse_categorical_crossentropy",
                      optimizer=tf.keras.optimizers.Adam(),
                      metrics=["accuracy"])

# Load the saved weights
created_model.load_weights(checkpoint_path)

# Evaluate the model with loaded weights
results_created_model_with_loaded_weights = created_model.evaluate(test_data)

# Compare results with original model
import numpy as np
assert np.isclose(results_feature_extract_model, results_created_model_with_loaded_weights).all(), "Loaded weights results are not close to original model."  # check if all elements in array are close

In short, instead of using tf.keras.clone_model to compare weights, recreate a new instance of the same model and load the weights instead.

@mrdbourke
Copy link
Owner Author

Continuing this here: #550

In short, it looks like TensorFlow 2.13+ (available via tf-nightly as of May 2023)
fixes most of the issues discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant