diff --git a/19_training_and_deploying_at_scale.ipynb b/19_training_and_deploying_at_scale.ipynb index c22a86c..ead10a4 100644 --- a/19_training_and_deploying_at_scale.ipynb +++ b/19_training_and_deploying_at_scale.ipynb @@ -2663,6 +2663,23 @@ "# Extra Material – Distributed Keras Tuner on Vertex AI" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Instead of using Vertex AI's hyperparameter tuning service, you can use [Keras Tuner](https://keras.io/keras_tuner/) (introduced in Chapter 10) and run it on Vertex AI VMs. Keras Tuner provides a simple way to scale hyperparameter search by distributing it across multiple machines: it only requires setting three environment variables on each machine, then running your regular Keras Tuner code on each machine. You can use the exact same script on all machines. One of the machines acts as the chief, and the others act as workers. Each worker asks the chief which hyperparameter values to try—it acts as the oracle—then the worker trains the model using these hyperparameter values, and finally it reports the model's performance back to the chief, which can then decide which hyperparameter values the worker should try next.\n", + "\n", + "The three environment variables you need to set on each machine are:\n", + "\n", + "* `KERASTUNER_TUNER_ID`: equal to `\"chief\"` on the chief machine, or a unique identifier on each worker machine, such as `\"worker0\"`, `\"worker1\"`, etc.\n", + "* `KERASTUNER_ORACLE_IP`: the IP address or hostname of the chief machine. The chief itself should generally use `\"0.0.0.0\"` to listen on every IP address on the machine.\n", + "* `KERASTUNER_ORACLE_PORT`: the TCP port that the chief will be listening on.\n", + "\n", + "You can use distributed Keras Tuner on any set of machines. If you want to run it on Vertex AI machines, then you can spawn a regular training job, and just modify the training script to set the three environment variables properly before using Keras Tuner.\n", + "\n", + "For example, the script below starts by parsing the `TF_CONFIG` environment variable, which will be automatically set by Vertex AI, just like earlier. It finds the address of the task of type `\"chief\"`, and it extracts the IP address or hostname, and the TCP port. It then defines the tuner ID as the task type followed by the task index, for example `\"worker0\"`. If the tuner ID is `\"chief0\"`, it changes it to `\"chief\"`, and it sets the IP to `\"0.0.0.0\"`: this will make it listen on all IPv4 address on its machine. Then it defines the environment variables for Keras Tuner. Next, the script creates a tuner, just like in Chapter 10, the it runs the search, and finally it saves the best model to the location given by Vertex AI:" + ] + }, { "cell_type": "code", "execution_count": 94, @@ -2689,7 +2706,11 @@ "if tuner_id == \"chief0\":\n", " tuner_id = \"chief\"\n", " chief_ip = \"0.0.0.0\"\n", - " # extra code – shows one way to start a worker on the chief machine\n", + " # extra code – since the chief doesn't work much, you can optimize compute\n", + " # resources by running a worker on the same machine. To do this, you can\n", + " # just make the chief start another process, after tweaking the TF_CONFIG\n", + " # environment variable to set the task type to \"worker\" and the task index\n", + " # to a unique value. Uncomment the next few lines to give this a try:\n", " # import subprocess\n", " # import sys\n", " # tf_config[\"task\"][\"type\"] = \"workerX\" # the worker on the chief's machine\n", @@ -2753,6 +2774,13 @@ " best_model.save(os.getenv(\"AIP_MODEL_DIR\"))" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that Vertex AI automatically mounts the `/gcs` directory to GCS, using the open source [GCS Fuse adapter](https://cloud.google.com/storage/docs/gcs-fuse). This gives us a shared directory across the workers and the chief, which is required by Keras Tuner. Also note that we set the distribution strategy to a `MirroredStrategy`. This will allow each worker to use all the GPUs on its machine, if there's more than one.\n" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -2773,6 +2801,13 @@ " f.write(script.replace(\"/gcs/my_bucket/\", f\"/gcs/{bucket_name}/\"))" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now all we need to do is to start a custom training job based on this script, exactly like in the previous section. Don't forget to add `keras-tuner` to the list of `requirements`:" + ] + }, { "cell_type": "code", "execution_count": 96, @@ -2865,6 +2900,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And we have a model!" + ] + }, { "cell_type": "markdown", "metadata": {},