Notebook 07: Cannot save model checkpoint for FoodVision Big #449

mrdbourke · 2022-09-12T06:08:57Z

Getting an error when training FoodVision Big:

# Fit the model with callbacks
history_101_food_classes_feature_extract = model.fit(train_data, 
                                                     epochs=3,
                                                     steps_per_epoch=len(train_data),
                                                     validation_data=test_data,
                                                     validation_steps=int(0.15 * len(test_data)),
                                                     callbacks=[create_tensorboard_callback("training_logs", 
                                                                                            "efficientnetb0_101_classes_all_data_feature_extract"),
                                                                model_checkpoint])

>>>WARNING:tensorflow:Can save best model only with val_acc available, skipping.

Looks like it's an issue with the model_checkpoint callback.

This causes the assertion for the cloned model later on to fail:

# Evalaute cloned model with loaded weights (should be same score as trained model)
results_cloned_model_with_loaded_weights = cloned_model.evaluate(test_data)

>>> ---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_1443486/1110829135.py in <module>
      1 # Loaded checkpoint weights should return very similar results to checkpoint weights prior to saving
      2 import numpy as np
----> 3 assert np.isclose(results_feature_extract_model, results_cloned_model_with_loaded_weights).all() # check if all elements in array are close

AssertionError:

Need to update the model checkpoint to make sure it can save a model whilst training.

The text was updated successfully, but these errors were encountered:

mrdbourke · 2022-09-12T06:13:33Z

Also getting this in Google Colab:

mrdbourke · 2023-05-18T05:23:21Z

After troubleshooting this for a while, it seems there may be something up with the tf.keras.clone_model method.

What exactly, I'm not sure.

It could be due to the use of tf.keras.applications.efficientnet models (which are notorious for errors across TensorFlow versions.

In saying that, a fix I've found to demonstrate the "cloning" and loading of weights is to create a copy of the model by using the exact same code to create it:

# Create a function to recreate the original model
def create_model():
  # Create base model
  input_shape = (224, 224, 3)
  base_model = tf.keras.applications.efficientnet.EfficientNetB0(include_top=False)
  base_model.trainable = False # freeze base model layers

  # Create Functional model 
  inputs = layers.Input(shape=input_shape, name="input_layer")
  # Note: EfficientNetBX models have rescaling built-in but if your model didn't you could have a layer like below
  # x = layers.Rescaling(1./255)(x)
  x = base_model(inputs, training=False) # set base_model to inference mode only
  x = layers.GlobalAveragePooling2D(name="pooling_layer")(x)
  x = layers.Dense(len(class_names))(x) # want one output neuron per class 
  # Separate activation of output layer so we can output float32 activations
  outputs = layers.Activation("softmax", dtype=tf.float32, name="softmax_float32")(x) 
  model = tf.keras.Model(inputs, outputs)
  
  return model

# Create and compile a new version of the original model (new weights)
created_model = create_model()
created_model.compile(loss="sparse_categorical_crossentropy",
                      optimizer=tf.keras.optimizers.Adam(),
                      metrics=["accuracy"])

# Load the saved weights
created_model.load_weights(checkpoint_path)

# Evaluate the model with loaded weights
results_created_model_with_loaded_weights = created_model.evaluate(test_data)

# Compare results with original model
import numpy as np
assert np.isclose(results_feature_extract_model, results_created_model_with_loaded_weights).all(), "Loaded weights results are not close to original model."  # check if all elements in array are close

In short, instead of using tf.keras.clone_model to compare weights, recreate a new instance of the same model and load the weights instead.

mrdbourke · 2023-05-19T01:12:35Z

Continuing this here: #550

In short, it looks like TensorFlow 2.13+ (available via tf-nightly as of May 2023)
fixes most of the issues discussed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook 07: Cannot save model checkpoint for FoodVision Big #449

Notebook 07: Cannot save model checkpoint for FoodVision Big #449

mrdbourke commented Sep 12, 2022

mrdbourke commented Sep 12, 2022

mrdbourke commented May 18, 2023

mrdbourke commented May 19, 2023

Notebook 07: Cannot save model checkpoint for FoodVision Big #449

Notebook 07: Cannot save model checkpoint for FoodVision Big #449

Comments

mrdbourke commented Sep 12, 2022

mrdbourke commented Sep 12, 2022

mrdbourke commented May 18, 2023

mrdbourke commented May 19, 2023