TensorFlow Keras Distributed with Kale¶
This section will guide you through creating and managing a TFJob
CR on
Kubeflow, with Kale and the Kubeflow Training Operator.
For this guide, we leverage the interactive environment of JupyterLab, but this
is completely optional. We do it to demonstrate how you can monitor such a job
using Kale’s TFJob
client.
Note
Currently, Kale supports the TensorFlow Keras API for creating and launching TensorFlow distributed jobs. In future versions, we plan to support the TensorFlow Estimator API as well.
Overview
What You’ll Need¶
- An Arrikto EKF or MiniKF deployment.
- The Kale TensorFlow Docker image.
Procedure¶
Create a new notebook server using the Kale TensorFlow Docker image. The image will have the following naming scheme:
gcr.io/arrikto/jupyter-kale-gpu-tf-py38:<IMAGE_TAG>Note
The
<IMAGE_TAG>
varies based on the MiniKF or EKF release.This is not the default Kale image, so you must carefully choose it from the dropdown menu.
Note
If you want to have access to a GPU device, you must specifically request one or more from the Jupyter Web App UI. For this user guide, access to a GPU device is not required.
Increase the size of the workspace volume to 10GB:
Connect to the server, open a terminal, and install the
tensorflow-datasets
package:$ pip3 install --user tensorflow-datasetsCreate a new Jupyter notebook (that is, an IPYNB file) using the JupyterLab UI:
Copy and paste the import statements on the top code cell. Then run it:
This is how your notebook cell will look like:
Create a function which downloads the MNIST dataset. Copy and paste the following code snippet to a new code cell. Then run it:
This is how your notebook cell will look like:
This cell creates a function which downloads the MNIST dataset. You will use the training split, scale it and create batches of 64 samples.
Define a function that creates, compiles, and returns a
tf.keras
Model. Copy and paste the following code snippet to a new code cell. Then run it:This is how your notebook cell will look like:
This cell defines a function that creates a standard Convolutional Neural Network architecture. It sets
adam
as the trainining process optimizer, the loss function tosparse_categorical_corssentropy
, and the accuracy metric. You will train this model architecture on the MNIST dataset in a distributed manner.Create and submit a
TFJob
CR. Copy and paste the following code snippet to a new code cell. Then run it:This is how your notebook cell will look like:
At a high level, the
distribute
function follows this process:- Save several assets to a local folder, including the training dataset and the model definition function.
- Snapshot the volumes mounted to the notebook server.
- Hydrate new PVCs starting from the snapshots of the previous step.
- Create and submit a
TFJob
CR. All the workers mount the newly created PVCs as RWX.
Upon submission of the CR, the Training Operator creates the two processes you requested with the
number_of_processes
argument. By default, each process requests to consume a GPU device. These Pods run a Kale entrypoint which- Looks for the assets saved during the preparation phase in the local FS (backed by one of the RWX PVCs), and loads them into memory.
- Creates a suitable TensorFlow distribution strategy.
- Instantiates and compiles the model inside the scope of this strategy.
- Calls the model’s
fit
function to start the training process.
Note
The TF Keras
fit
function accepts a number of different arguments. To view the list of arguments that thefit
function accepts see the TF Keras documentation. You can pass any of those arguments in thefit_kwargs
argument of thedistribute
function, as a Python dictionary. Kale will make sure to pass it down the line. However, you cannot pass values forx
, andy
arguments. Kale expects you to use thetrain_data_fn
argument to provide a function that creates the dataset. Then, Kale knows how to populate thex
andy
arguments.Note
If you want to distribute your model across multiple CPU cores, you can set the
cuda
argument toFalse
. By default, Kale will launch two processes (the minimum number of processes required by theTFJob
CR), on two different GPU devices.Monitor the
TFJob
CR. Copy and paste the following code snippet to a new code cell. Then run it:This is how your notebook cell will look like:
Note
In this step you monitor the state of the Job. The state can be in one of the following states:
Created
,Running
,Failed
, orSucceeded
. This call blocks until the training process finishes. To continue with the next step and view the logs, you can stop the interpreter by pressing the stop button in the notebook UI. Instead of the above, you can call theget_job_status
function of the client with no arguments. The function will return immediately, reporting back the current status of the Job.Stream the logs of the master process. Copy and paste the following code snippet to a new code cell. Then run it:
This is how your notebook cell will look like:
Note
In this step you view the logs of the Pod-running worker with index
0
. You can view the logs of any other worker as well, however, in most cases, they are identical. This call blocks until the training process finishes. If you want to continue executing other notebook cells, you can stop the interpreter by pressing the stop button in the notebook UI.Optional
When the training process completes, you can delete the
TFJob
CR. Copy and paste the following code snippet to a new code cell. Then run it:This is how your notebook cell will look like:
Important
After the completion of the training process, the controller will not remove the resources it creates. If you do not want to leave stale resources, you have to manually delete the CR using the above command.
Summary¶
You have successfully run a TF Keras distributed process using Kale and the Kubeflow Training Operator.
What’s Next¶
The next step is to create a TF Keras distributed KFP step using the Kale SDK.