Final Up to date on July 3, 2022
Deep studying fashions can take hours, days and even weeks to coach.
If the run is stopped unexpectedly, you may lose lots of work.
On this submit you’ll uncover how one can check-point your deep studying fashions throughout coaching in Python utilizing the Keras library.
Kick-start your mission with my new ebook Deep Studying With Python, together with step-by-step tutorials and the Python supply code information for all examples.
Let’s get began.
- Jun/2016: First printed
- Replace Mar/2017: Up to date for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
- Replace Mar/2018: Added alternate hyperlink to obtain the dataset.
- Replace Sep/2019: Up to date for Keras 2.2.5 API.
- Replace Oct/2019: Up to date for Keras 2.3.0 API.
- Replace Jul/2022: Up to date for TensorFlow 2.x API and point out about EarlyStopping

Learn how to Verify-Level Deep Studying Fashions in Keras
Photograph by saragoldsmith, some rights reserved.
Checkpointing Neural Community Fashions
Software checkpointing is a fault tolerance method for lengthy working processes.
It’s an method the place a snapshot of the state of the system is taken in case of system failure. If there’s a drawback, not all is misplaced. The checkpoint could also be used immediately, or used as the start line for a brand new run, choosing up the place it left off.
When coaching deep studying fashions, the checkpoint is the weights of the mannequin. These weights can be utilized to make predictions as is, or used as the idea for ongoing coaching.
The Keras library supplies a checkpointing functionality by a callback API.
The ModelCheckpoint callback class permits you to outline the place to checkpoint the mannequin weights, how the file ought to named and beneath what circumstances to make a checkpoint of the mannequin.
The API permits you to specify which metric to observe, reminiscent of loss or accuracy on the coaching or validation dataset. You possibly can specify whether or not to search for an enchancment in maximizing or minimizing the rating. Lastly, the filename that you simply use to retailer the weights can embrace variables just like the epoch quantity or metric.
The ModelCheckpoint can then be handed to the coaching course of when calling the match() operate on the mannequin.
Word, you could want to put in the h5py library to output community weights in HDF5 format.
Need assistance with Deep Studying in Python?
Take my free 2-week electronic mail course and uncover MLPs, CNNs and LSTMs (with code).
Click on to sign-up now and in addition get a free PDF E-book model of the course.
Checkpoint Neural Community Mannequin Enhancements
A very good use of checkpointing is to output the mannequin weights every time an enchancment is noticed throughout coaching.
The instance under creates a small neural community for the Pima Indians onset of diabetes binary classification drawback. The instance assume that the pima-indians-diabetes.csv file is in your working listing.
You possibly can obtain the dataset from right here:
The instance makes use of 33% of the information for validation.
Checkpointing is setup to save lots of the community weights solely when there’s an enchancment in classification accuracy on the validation dataset (monitor=’val_accuracy’ and mode=’max’). The weights are saved in a file that features the rating within the filename (weights-improvement-{val_accuracy=.2f}.hdf5).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Checkpoint the weights when validation accuracy improves from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import ModelCheckpoint import matplotlib.pyplot as plt import numpy as np import tensorflow as tf seed = 42 tf.random.set_seed(seed) # load pima indians dataset dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # cut up into enter (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create mannequin mannequin = Sequential() mannequin.add(Dense(12, input_shape=(8,), activation=‘relu’)) mannequin.add(Dense(8, activation=‘relu’)) mannequin.add(Dense(1, activation=‘sigmoid’)) # Compile mannequin mannequin.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # checkpoint filepath=“weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5” checkpoint = ModelCheckpoint(filepath, monitor=‘val_accuracy’, verbose=1, save_best_only=True, mode=‘max’) callbacks_list = [checkpoint] # Match the mannequin mannequin.match(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0) |
Word: Your outcomes could differ given the stochastic nature of the algorithm or analysis process, or variations in numerical precision. Take into account working the instance a number of occasions and evaluate the common end result.
Operating the instance produces the next output (truncated for brevity).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
… Epoch 00134: val_accuracy didn’t enhance Epoch 00135: val_accuracy didn’t enhance Epoch 00136: val_accuracy didn’t enhance Epoch 00137: val_accuracy didn’t enhance Epoch 00138: val_accuracy didn’t enhance Epoch 00139: val_accuracy didn’t enhance Epoch 00140: val_accuracy improved from 0.83465 to 0.83858, saving mannequin to weights-improvement-140-0.84.hdf5 Epoch 00141: val_accuracy didn’t enhance Epoch 00142: val_accuracy didn’t enhance Epoch 00143: val_accuracy didn’t enhance Epoch 00144: val_accuracy didn’t enhance Epoch 00145: val_accuracy didn’t enhance Epoch 00146: val_accuracy improved from 0.83858 to 0.84252, saving mannequin to weights-improvement-146-0.84.hdf5 Epoch 00147: val_accuracy didn’t enhance Epoch 00148: val_accuracy improved from 0.84252 to 0.84252, saving mannequin to weights-improvement-148-0.84.hdf5 Epoch 00149: val_accuracy didn’t enhance |
You will note quite a few information in your working listing containing the community weights in HDF5 format. For instance:
… weights-improvement-53-0.76.hdf5 weights-improvement-71-0.76.hdf5 weights-improvement-77-0.78.hdf5 weights-improvement-99-0.78.hdf5 |
This can be a quite simple checkpointing technique.
It might create lots of pointless check-point information if the validation accuracy strikes up and down over coaching epochs. However, it should guarantee that you’ve got a snapshot of the most effective mannequin found throughout your run.
Checkpoint Greatest Neural Community Mannequin Solely
An easier check-point technique is to save lots of the mannequin weights to the identical file, if and provided that the validation accuracy improves.
This may be carried out simply utilizing the identical code from above and altering the output filename to be mounted (not embrace rating or epoch info).
On this case, mannequin weights are written to the file “weights.greatest.hdf5” provided that the classification accuracy of the mannequin on the validation dataset improves over the most effective seen up to now.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Checkpoint the weights for greatest mannequin on validation accuracy from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import ModelCheckpoint import matplotlib.pyplot as plt import numpy as np # load pima indians dataset dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # cut up into enter (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create mannequin mannequin = Sequential() mannequin.add(Dense(12, input_shape=(8,), activation=‘relu’)) mannequin.add(Dense(8, activation=‘relu’)) mannequin.add(Dense(1, activation=‘sigmoid’)) # Compile mannequin mannequin.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # checkpoint filepath=“weights.greatest.hdf5” checkpoint = ModelCheckpoint(filepath, monitor=‘val_accuracy’, verbose=1, save_best_only=True, mode=‘max’) callbacks_list = [checkpoint] # Match the mannequin mannequin.match(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0) |
Word: Your outcomes could differ given the stochastic nature of the algorithm or analysis process, or variations in numerical precision. Take into account working the instance a number of occasions and evaluate the common end result.
Operating this instance supplies the next output (truncated for brevity).
… Epoch 00139: val_accuracy improved from 0.79134 to 0.79134, saving mannequin to weights.greatest.hdf5 Epoch 00140: val_accuracy didn’t enhance Epoch 00141: val_accuracy didn’t enhance Epoch 00142: val_accuracy didn’t enhance Epoch 00143: val_accuracy didn’t enhance Epoch 00144: val_accuracy improved from 0.79134 to 0.79528, saving mannequin to weights.greatest.hdf5 Epoch 00145: val_accuracy improved from 0.79528 to 0.79528, saving mannequin to weights.greatest.hdf5 Epoch 00146: val_accuracy didn’t enhance Epoch 00147: val_accuracy didn’t enhance Epoch 00148: val_accuracy didn’t enhance Epoch 00149: val_accuracy didn’t enhance |
You need to see the load file in your native listing.
This can be a useful checkpoint technique to all the time use throughout your experiments.
It’ll be sure that your greatest mannequin is saved for the run so that you can use later if you want. It avoids you needing to incorporate code to manually preserve observe and serialize the most effective mannequin when coaching.
Use EarlyStopping along with Checkpoint
Within the examples above, we tried to suit our mannequin with 150 epochs. In actuality, it’s not straightforward to inform what number of epochs we have to prepare our mannequin. One option to tackle this drawback is to overestimate the variety of epochs. However this may occasionally take a big time. In any case, if we’re checkpointing the most effective mannequin solely, we could discover that over the a number of thousand epochs we run, we already achieved the most effective mannequin within the first hundred epochs and no extra checkpoints are made afterwards.
That is fairly frequent to see we use the ModelCheckpoint callback along with EarlyStopping. It helps to cease the coaching as soon as we don’t see the metric enhance for a number of epochs. The instance under provides the callback es
for making the coaching early cease as soon as we don’t see the validation accuracy enhance for five consecutive epochs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Checkpoint the weights for greatest mannequin on validation accuracy from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping import matplotlib.pyplot as plt import numpy as np # load pima indians dataset dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # cut up into enter (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create mannequin mannequin = Sequential() mannequin.add(Dense(12, input_shape=(8,), activation=‘relu’)) mannequin.add(Dense(8, activation=‘relu’)) mannequin.add(Dense(1, activation=‘sigmoid’)) # Compile mannequin mannequin.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) # checkpoint filepath=“weights.greatest.hdf5” checkpoint = ModelCheckpoint(filepath, monitor=‘val_accuracy’, verbose=1, save_best_only=True, mode=‘max’) es = EarlyStopping(monitor=‘val_accuracy’, endurance=5) callbacks_list = [checkpoint, es] # Match the mannequin mannequin.match(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0) |
Word: Your outcomes could differ given the stochastic nature of the algorithm or analysis process, or variations in numerical precision. Take into account working the instance a number of occasions and evaluate the common end result.
Operating this instance supplies the next output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Epoch 1: val_accuracy improved from -inf to 0.51969, saving mannequin to weights.greatest.hdf5 Epoch 2: val_accuracy didn’t enhance from 0.51969 Epoch 3: val_accuracy improved from 0.51969 to 0.54724, saving mannequin to weights.greatest.hdf5 Epoch 4: val_accuracy improved from 0.54724 to 0.61417, saving mannequin to weights.greatest.hdf5 Epoch 5: val_accuracy didn’t enhance from 0.61417 Epoch 6: val_accuracy didn’t enhance from 0.61417 Epoch 7: val_accuracy improved from 0.61417 to 0.66142, saving mannequin to weights.greatest.hdf5 Epoch 8: val_accuracy didn’t enhance from 0.66142 Epoch 9: val_accuracy didn’t enhance from 0.66142 Epoch 10: val_accuracy improved from 0.66142 to 0.68504, saving mannequin to weights.greatest.hdf5 Epoch 11: val_accuracy didn’t enhance from 0.68504 Epoch 12: val_accuracy didn’t enhance from 0.68504 Epoch 13: val_accuracy didn’t enhance from 0.68504 Epoch 14: val_accuracy didn’t enhance from 0.68504 Epoch 15: val_accuracy improved from 0.68504 to 0.69685, saving mannequin to weights.greatest.hdf5 Epoch 16: val_accuracy improved from 0.69685 to 0.71260, saving mannequin to weights.greatest.hdf5 Epoch 17: val_accuracy improved from 0.71260 to 0.72047, saving mannequin to weights.greatest.hdf5 Epoch 18: val_accuracy didn’t enhance from 0.72047 Epoch 19: val_accuracy didn’t enhance from 0.72047 Epoch 20: val_accuracy didn’t enhance from 0.72047 Epoch 21: val_accuracy didn’t enhance from 0.72047 Epoch 22: val_accuracy didn’t enhance from 0.72047 |
This coaching course of stopped after epoch 22 as there are not any higher accuracy achieved for the final 5 epochs.
Loading a Verify-Pointed Neural Community Mannequin
Now that you’ve got seen methods to checkpoint your deep studying fashions throughout coaching, you must evaluate methods to load and use a checkpointed mannequin.
The checkpoint solely contains the mannequin weights. It assumes the community construction. This too might be serialize to file in JSON or YAML format.
Within the instance under, the mannequin construction is understood and the most effective weights are loaded from the earlier experiment, saved within the working listing within the weights.greatest.hdf5 file.
The mannequin is then used to make predictions on the whole dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Learn how to load and use weights from a checkpoint from tensorflow.keras.fashions import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import ModelCheckpoint import matplotlib.pyplot as plt import numpy as np # create mannequin mannequin = Sequential() mannequin.add(Dense(12, input_shape=(8,), activation=‘relu’)) mannequin.add(Dense(8, activation=‘relu’)) mannequin.add(Dense(1, activation=‘sigmoid’)) # load weights mannequin.load_weights(“weights.greatest.hdf5”) # Compile mannequin (required to make predictions) mannequin.compile(loss=‘binary_crossentropy’, optimizer=‘adam’, metrics=[‘accuracy’]) print(“Created mannequin and loaded weights from file”) # load pima indians dataset dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=“,”) # cut up into enter (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # estimate accuracy on entire dataset utilizing loaded weights scores = mannequin.consider(X, Y, verbose=0) print(“%s: %.2f%%” % (mannequin.metrics_names[1], scores[1]*100)) |
Word: Your outcomes could differ given the stochastic nature of the algorithm or analysis process, or variations in numerical precision. Take into account working the instance a number of occasions and evaluate the common end result.
Operating the instance produces the next output.
Created mannequin and loaded weights from file acc: 77.73% |
Abstract
On this submit you may have found the significance of checkpointing deep studying fashions for lengthy coaching runs.
You discovered two checkpointing methods that you need to use in your subsequent deep studying mission:
- Checkpoint Mannequin Enhancements.
- Checkpoint Greatest Mannequin Solely.
You additionally discovered methods to load a checkpointed mannequin and make predictions.
Do you may have any questions on checkpointing deep studying fashions or about this submit? Ask your questions within the feedback and I’ll do my greatest to reply.