handson-ml/13_loading_and_preprocessin...

2892 lines
90 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Chapter 13 Loading and Preprocessing Data with TensorFlow**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_This notebook contains all the sample code and solutions to the exercises in chapter 13._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a href=\"https://colab.research.google.com/github/ageron/handson-ml3/blob/main/13_loading_and_preprocessing_data.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml3/blob/main/13_loading_and_preprocessing_data.ipynb\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" /></a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This project requires Python 3.8 or above:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"\n",
"assert sys.version_info >= (3, 8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It also requires Scikit-Learn ≥ 1.0.1:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
"\n",
"assert sklearn.__version__ >= \"1.0.1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And TensorFlow ≥ 2.6:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"\n",
"assert tf.__version__ >= \"2.7.0\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The tf.data API"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"\n",
"X = tf.range(10) # any data tensor\n",
"dataset = tf.data.Dataset.from_tensor_slices(X)\n",
"dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"X_nested = {\"a\": ([1, 2, 3], [4, 5, 6]), \"b\": [7, 8, 9]}\n",
"dataset = tf.data.Dataset.from_tensor_slices(X_nested)\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chaining Transformations"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"dataset = tf.data.Dataset.from_tensor_slices(tf.range(10))\n",
"dataset = dataset.repeat(3).batch(7)\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.map(lambda x: x * 2) # x is a batch\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.filter(lambda x: tf.reduce_sum(x) > 50)\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"for item in dataset.take(2):\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Shuffling the Data"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"dataset = tf.data.Dataset.range(10).repeat(2)\n",
"dataset = dataset.shuffle(buffer_size=4, seed=42).batch(7)\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interleaving lines from multiple files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by loading and preparing the California housing dataset. We first load it, then split it into a training set, a validation set and a test set:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# extra code fetches, splits and normalizes the California housing dataset\n",
"\n",
"from sklearn.datasets import fetch_california_housing\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"housing = fetch_california_housing()\n",
"X_train_full, X_test, y_train_full, y_test = train_test_split(\n",
" housing.data, housing.target.reshape(-1, 1), random_state=42)\n",
"X_train, X_valid, y_train, y_valid = train_test_split(\n",
" X_train_full, y_train_full, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a very large dataset that does not fit in memory, you will typically want to split it into many files first, then have TensorFlow read these files in parallel. To demonstrate this, let's start by splitting the housing dataset and saving it to 20 CSV files:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# extra code split the dataset into 20 parts and save it to CSV files\n",
"\n",
"import numpy as np\n",
"from pathlib import Path\n",
"\n",
"def save_to_csv_files(data, name_prefix, header=None, n_parts=10):\n",
" housing_dir = Path() / \"datasets\" / \"housing\"\n",
" housing_dir.mkdir(parents=True, exist_ok=True)\n",
" filename_format = \"my_{}_{:02d}.csv\"\n",
"\n",
" filepaths = []\n",
" m = len(data)\n",
" chunks = np.array_split(np.arange(m), n_parts)\n",
" for file_idx, row_indices in enumerate(chunks):\n",
" part_csv = housing_dir / filename_format.format(name_prefix, file_idx)\n",
" filepaths.append(str(part_csv))\n",
" with open(part_csv, \"w\") as f:\n",
" if header is not None:\n",
" f.write(header)\n",
" f.write(\"\\n\")\n",
" for row_idx in row_indices:\n",
" f.write(\",\".join([repr(col) for col in data[row_idx]]))\n",
" f.write(\"\\n\")\n",
" return filepaths\n",
"\n",
"train_data = np.c_[X_train, y_train]\n",
"valid_data = np.c_[X_valid, y_valid]\n",
"test_data = np.c_[X_test, y_test]\n",
"header_cols = housing.feature_names + [\"MedianHouseValue\"]\n",
"header = \",\".join(header_cols)\n",
"\n",
"train_filepaths = save_to_csv_files(train_data, \"train\", header, n_parts=20)\n",
"valid_filepaths = save_to_csv_files(valid_data, \"valid\", header, n_parts=10)\n",
"test_filepaths = save_to_csv_files(test_data, \"test\", header, n_parts=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, now let's take a peek at the first few lines of one of these CSV files:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"print(\"\".join(open(train_filepaths[0]).readlines()[:4]))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"train_filepaths"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Building an Input Pipeline**"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows that the file paths are shuffled\n",
"for filepath in filepath_dataset:\n",
" print(filepath)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"n_readers = 5\n",
"dataset = filepath_dataset.interleave(\n",
" lambda filepath: tf.data.TextLineDataset(filepath).skip(1),\n",
" cycle_length=n_readers)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"for line in dataset.take(5):\n",
" print(line)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing the Data"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# extra code compute the mean and standard deviation of each feature\n",
"\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"scaler = StandardScaler()\n",
"scaler.fit(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"X_mean, X_std = scaler.mean_, scaler.scale_ # extra code\n",
"n_inputs = 8\n",
"\n",
"def parse_csv_line(line):\n",
" defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]\n",
" fields = tf.io.decode_csv(line, record_defaults=defs)\n",
" return tf.stack(fields[:-1]), tf.stack(fields[-1:])\n",
"\n",
"def preprocess(line):\n",
" x, y = parse_csv_line(line)\n",
" return (x - X_mean) / X_std, y"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Putting Everything Together + Prefetching"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def csv_reader_dataset(filepaths, n_readers=5, n_read_threads=None,\n",
" n_parse_threads=5, shuffle_buffer_size=10_000, seed=42,\n",
" batch_size=32):\n",
" dataset = tf.data.Dataset.list_files(filepaths, seed=seed)\n",
" dataset = dataset.interleave(\n",
" lambda filepath: tf.data.TextLineDataset(filepath).skip(1),\n",
" cycle_length=n_readers, num_parallel_calls=n_read_threads)\n",
" dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)\n",
" dataset = dataset.shuffle(shuffle_buffer_size, seed=seed)\n",
" return dataset.batch(batch_size).prefetch(1)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# extra code show the first couple of batches produced by the dataset\n",
"\n",
"example_set = csv_reader_dataset(train_filepaths, batch_size=3)\n",
"for X_batch, y_batch in example_set.take(2):\n",
" print(\"X =\", X_batch)\n",
" print(\"y =\", y_batch)\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a short description of each method in the `Dataset` class:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# extra code list all methods of the tf.data.Dataset class\n",
"for m in dir(tf.data.Dataset):\n",
" if not (m.startswith(\"_\") or m.endswith(\"_\")):\n",
" func = getattr(tf.data.Dataset, m)\n",
" if hasattr(func, \"__doc__\"):\n",
" print(\"● {:21s}{}\".format(m + \"()\", func.__doc__.split(\"\\n\")[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the Dataset with Keras"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"train_set = csv_reader_dataset(train_filepaths)\n",
"valid_set = csv_reader_dataset(valid_filepaths)\n",
"test_set = csv_reader_dataset(test_filepaths)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# extra code for reproducibility\n",
"tf.keras.backend.clear_session()\n",
"tf.random.set_seed(42)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"model = tf.keras.Sequential([\n",
" tf.keras.layers.Dense(30, activation=\"relu\", kernel_initializer=\"he_normal\",\n",
" input_shape=X_train.shape[1:]),\n",
" tf.keras.layers.Dense(1),\n",
"])\n",
"model.compile(loss=\"mse\", optimizer=\"sgd\")\n",
"model.fit(train_set, validation_data=valid_set, epochs=5)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"test_mse = model.evaluate(test_set)\n",
"new_set = test_set.take(3) # pretend we have 3 new samples\n",
"y_pred = model.predict(new_set) # or you could just pass a NumPy array"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# extra code defines the optimizer and loss function for training\n",
"optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)\n",
2021-10-17 04:04:08 +02:00
"loss_fn = tf.keras.losses.mean_squared_error\n",
"\n",
"n_epochs = 5\n",
"for epoch in range(n_epochs):\n",
" for X_batch, y_batch in train_set:\n",
" # extra code perform one Gradient Descent step\n",
" # as explained in Chapter 12\n",
" print(\"\\rEpoch {}/{}\".format(epoch + 1, n_epochs), end=\"\")\n",
" with tf.GradientTape() as tape:\n",
" y_pred = model(X_batch)\n",
" main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))\n",
" loss = tf.add_n([main_loss] + model.losses)\n",
" gradients = tape.gradient(loss, model.trainable_variables)\n",
" optimizer.apply_gradients(zip(gradients, model.trainable_variables))"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"@tf.function\n",
"def train_one_epoch(model, optimizer, loss_fn, train_set):\n",
" for X_batch, y_batch in train_set:\n",
" with tf.GradientTape() as tape:\n",
" y_pred = model(X_batch)\n",
" main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))\n",
" loss = tf.add_n([main_loss] + model.losses)\n",
" gradients = tape.gradient(loss, model.trainable_variables)\n",
" optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n",
"\n",
"optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)\n",
"loss_fn = tf.keras.losses.mean_squared_error\n",
"for epoch in range(n_epochs):\n",
" print(\"\\rEpoch {}/{}\".format(epoch + 1, n_epochs), end=\"\")\n",
" train_one_epoch(model, optimizer, loss_fn, train_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The TFRecord Format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A TFRecord file is just a list of binary records. You can create one using a `tf.io.TFRecordWriter`:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"with tf.io.TFRecordWriter(\"my_data.tfrecord\") as f:\n",
" f.write(b\"This is the first record\")\n",
" f.write(b\"And this is the second record\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you can read it using a `tf.data.TFRecordDataset`:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"filepaths = [\"my_data.tfrecord\"]\n",
"dataset = tf.data.TFRecordDataset(filepaths)\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can read multiple TFRecord files with just one `TFRecordDataset`. By default it will read them one at a time, but if you set `num_parallel_reads=3`, it will read 3 at a time in parallel and interleave their records:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows how to read multiple files in parallel and interleave them\n",
"\n",
"filepaths = [\"my_test_{}.tfrecord\".format(i) for i in range(5)]\n",
"for i, filepath in enumerate(filepaths):\n",
" with tf.io.TFRecordWriter(filepath) as f:\n",
" for j in range(3):\n",
" f.write(\"File {} record {}\".format(i, j).encode(\"utf-8\"))\n",
"\n",
"dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=3)\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compressed TFRecord Files"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"options = tf.io.TFRecordOptions(compression_type=\"GZIP\")\n",
"with tf.io.TFRecordWriter(\"my_compressed.tfrecord\", options) as f:\n",
" f.write(b\"Compress, compress, compress!\")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"dataset = tf.data.TFRecordDataset([\"my_compressed.tfrecord\"],\n",
" compression_type=\"GZIP\")"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows that the data is decompressed correctly\n",
"for item in dataset:\n",
" print(item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Brief Introduction to Protocol Buffers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this section you need to [install protobuf](https://developers.google.com/protocol-buffers/docs/downloads). In general you will not have to do so when using TensorFlow, as it comes with functions to create and parse protocol buffers of type `tf.train.Example`, which are generally sufficient. However, in this section we will learn about protocol buffers by creating our own simple protobuf definition, so we need the protobuf compiler (`protoc`): we will use it to compile the protobuf definition to a Python module that we can then use in our code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First let's write a simple protobuf definition:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"%%writefile person.proto\n",
"syntax = \"proto3\";\n",
"message Person {\n",
" string name = 1;\n",
" int32 id = 2;\n",
" repeated string email = 3;\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's compile it (the `--descriptor_set_out` and `--include_imports` options are only required for the `tf.io.decode_proto()` example below):"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"%ls person*"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"from person_pb2 import Person # import the generated access class\n",
"\n",
"person = Person(name=\"Al\", id=123, email=[\"a@b.com\"]) # create a Person\n",
"print(person) # display the Person"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"person.name # read a field"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"person.name = \"Alice\" # modify a field"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"person.email[0] # repeated fields can be accessed like arrays"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"person.email.append(\"c@d.com\") # add an email address"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"serialized = person.SerializeToString() # serialize person to a byte string\n",
"serialized"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"person2 = Person() # create a new Person\n",
"person2.ParseFromString(serialized) # parse the byte string (27 bytes long)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"person == person2 # now they are equal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom protobuf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In rare cases, you may want to parse a custom protobuf (like the one we just created) in TensorFlow. For this you can use the `tf.io.decode_proto()` function:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows how to use the tf.io.decode_proto() function\n",
"\n",
"person_tf = tf.io.decode_proto(\n",
" bytes=serialized,\n",
" message_type=\"Person\",\n",
" field_names=[\"name\", \"id\", \"email\"],\n",
" output_types=[tf.string, tf.int32, tf.string],\n",
" descriptor_source=\"person.desc\")\n",
"\n",
"person_tf.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more details, see the [`tf.io.decode_proto()`](https://www.tensorflow.org/api_docs/python/tf/io/decode_proto) documentation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TensorFlow Protobufs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the definition of the tf.train.Example protobuf:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```proto\n",
"syntax = \"proto3\";\n",
"\n",
"message BytesList { repeated bytes value = 1; }\n",
"message FloatList { repeated float value = 1 [packed = true]; }\n",
"message Int64List { repeated int64 value = 1 [packed = true]; }\n",
"message Feature {\n",
" oneof kind {\n",
" BytesList bytes_list = 1;\n",
" FloatList float_list = 2;\n",
" Int64List int64_list = 3;\n",
" }\n",
"};\n",
"message Features { map<string, Feature> feature = 1; };\n",
"message Example { Features features = 1; };\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.train import BytesList, FloatList, Int64List\n",
"from tensorflow.train import Feature, Features, Example\n",
"\n",
"person_example = Example(\n",
" features=Features(\n",
" feature={\n",
" \"name\": Feature(bytes_list=BytesList(value=[b\"Alice\"])),\n",
" \"id\": Feature(int64_list=Int64List(value=[123])),\n",
" \"emails\": Feature(bytes_list=BytesList(value=[b\"a@b.com\",\n",
" b\"c@d.com\"]))\n",
" }))"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"with tf.io.TFRecordWriter(\"my_contacts.tfrecord\") as f:\n",
" for _ in range(5):\n",
" f.write(person_example.SerializeToString())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading and Parsing Examples"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"feature_description = {\n",
" \"name\": tf.io.FixedLenFeature([], tf.string, default_value=\"\"),\n",
" \"id\": tf.io.FixedLenFeature([], tf.int64, default_value=0),\n",
" \"emails\": tf.io.VarLenFeature(tf.string),\n",
"}\n",
"\n",
"def parse(serialized_example):\n",
" return tf.io.parse_single_example(serialized_example, feature_description)\n",
"\n",
"dataset = tf.data.TFRecordDataset([\"my_contacts.tfrecord\"]).map(parse)\n",
"for parsed_example in dataset:\n",
" print(parsed_example)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"tf.sparse.to_dense(parsed_example[\"emails\"], default_value=b\"\")"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"parsed_example[\"emails\"].values"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"def parse(serialized_examples):\n",
" return tf.io.parse_example(serialized_examples, feature_description)\n",
"\n",
"dataset = tf.data.TFRecordDataset([\"my_contacts.tfrecord\"]).batch(2).map(parse)\n",
"for parsed_examples in dataset:\n",
" print(parsed_examples) # two examples at a time"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"parsed_examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra Material Storing Images and Tensors in TFRecords"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load and display an example image:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"from sklearn.datasets import load_sample_images\n",
"\n",
"img = load_sample_images()[\"images\"][0]\n",
"plt.imshow(img)\n",
"plt.axis(\"off\")\n",
"plt.title(\"Original Image\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's create an `Example` protobuf containing the image encoded as JPEG:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"data = tf.io.encode_jpeg(img)\n",
"example_with_image = Example(features=Features(feature={\n",
" \"image\": Feature(bytes_list=BytesList(value=[data.numpy()]))}))\n",
"serialized_example = example_with_image.SerializeToString()\n",
"with tf.io.TFRecordWriter(\"my_image.tfrecord\") as f:\n",
" f.write(serialized_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's create a tf.data pipeline that will read this TFRecord file, parse each `Example` protobuf (in this case just one), and parse and display the image that the example contains:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"feature_description = { \"image\": tf.io.VarLenFeature(tf.string) }\n",
"\n",
"def parse(serialized_example):\n",
" example_with_image = tf.io.parse_single_example(serialized_example,\n",
" feature_description)\n",
" return tf.io.decode_jpeg(example_with_image[\"image\"].values[0])\n",
" # or you can use tf.io.decode_image() instead\n",
"\n",
"dataset = tf.data.TFRecordDataset(\"my_image.tfrecord\").map(parse)\n",
"for image in dataset:\n",
" plt.imshow(image)\n",
" plt.axis(\"off\")\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or use `decode_image()` which supports BMP, GIF, JPEG and PNG formats:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tensors can be serialized and parsed easily using `tf.io.serialize_tensor()` and `tf.io.parse_tensor()`:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"tensor = tf.constant([[0., 1.], [2., 3.], [4., 5.]])\n",
"serialized = tf.io.serialize_tensor(tensor)\n",
"serialized"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"tf.io.parse_tensor(serialized, out_type=tf.float32)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"sparse_tensor = parsed_example[\"emails\"]\n",
"serialized_sparse = tf.io.serialize_sparse(sparse_tensor)\n",
"serialized_sparse"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"BytesList(value=serialized_sparse.numpy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling Lists of Lists Using the `SequenceExample` Protobuf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```proto\n",
"syntax = \"proto3\";\n",
"\n",
"message FeatureList { repeated Feature feature = 1; };\n",
"message FeatureLists { map<string, FeatureList> feature_list = 1; };\n",
"message SequenceExample {\n",
" Features context = 1;\n",
" FeatureLists feature_lists = 2;\n",
"};\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.train import FeatureList, FeatureLists, SequenceExample\n",
"\n",
"context = Features(feature={\n",
" \"author_id\": Feature(int64_list=Int64List(value=[123])),\n",
" \"title\": Feature(bytes_list=BytesList(value=[b\"A\", b\"desert\", b\"place\", b\".\"])),\n",
" \"pub_date\": Feature(int64_list=Int64List(value=[1623, 12, 25]))\n",
"})\n",
"\n",
"content = [[\"When\", \"shall\", \"we\", \"three\", \"meet\", \"again\", \"?\"],\n",
" [\"In\", \"thunder\", \",\", \"lightning\", \",\", \"or\", \"in\", \"rain\", \"?\"]]\n",
"comments = [[\"When\", \"the\", \"hurlyburly\", \"'s\", \"done\", \".\"],\n",
" [\"When\", \"the\", \"battle\", \"'s\", \"lost\", \"and\", \"won\", \".\"]]\n",
"\n",
"def words_to_feature(words):\n",
" return Feature(bytes_list=BytesList(value=[word.encode(\"utf-8\")\n",
" for word in words]))\n",
"\n",
"content_features = [words_to_feature(sentence) for sentence in content]\n",
"comments_features = [words_to_feature(comment) for comment in comments]\n",
" \n",
"sequence_example = SequenceExample(\n",
" context=context,\n",
" feature_lists=FeatureLists(feature_list={\n",
" \"content\": FeatureList(feature=content_features),\n",
" \"comments\": FeatureList(feature=comments_features)\n",
" }))"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"sequence_example"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"serialized_sequence_example = sequence_example.SerializeToString()"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"context_feature_descriptions = {\n",
" \"author_id\": tf.io.FixedLenFeature([], tf.int64, default_value=0),\n",
" \"title\": tf.io.VarLenFeature(tf.string),\n",
" \"pub_date\": tf.io.FixedLenFeature([3], tf.int64, default_value=[0, 0, 0]),\n",
"}\n",
"sequence_feature_descriptions = {\n",
" \"content\": tf.io.VarLenFeature(tf.string),\n",
" \"comments\": tf.io.VarLenFeature(tf.string),\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(\n",
" serialized_sequence_example, context_feature_descriptions,\n",
" sequence_feature_descriptions)\n",
"parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists[\"content\"])"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"parsed_context"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"parsed_context[\"title\"].values"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"parsed_feature_lists"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"print(tf.RaggedTensor.from_sparse(parsed_feature_lists[\"content\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Keras Preprocessing Layers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `Normalization` Layer"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"tf.random.set_seed(42) # extra code ensures reproducibility\n",
"norm_layer = tf.keras.layers.Normalization()\n",
"model = tf.keras.models.Sequential([\n",
" norm_layer,\n",
" tf.keras.layers.Dense(1)\n",
"])\n",
"model.compile(loss=\"mse\", optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3))\n",
"norm_layer.adapt(X_train) # computes the mean and variance of every feature\n",
"model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=5)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"norm_layer = tf.keras.layers.Normalization()\n",
"norm_layer.adapt(X_train)\n",
"X_train_scaled = norm_layer(X_train)\n",
"X_valid_scaled = norm_layer(X_valid)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"tf.random.set_seed(42) # extra code ensures reproducibility\n",
"model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])\n",
"model.compile(loss=\"mse\", optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3))\n",
"model.fit(X_train_scaled, y_train, epochs=5,\n",
" validation_data=(X_valid_scaled, y_valid))"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"final_model = tf.keras.Sequential([norm_layer, model])\n",
"X_new = X_test[:3] # pretend we have a few new instances (unscaled)\n",
"y_pred = final_model(X_new) # preprocesses the data and makes predictions"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"y_pred"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"# extra code creates a dataset to demo applying the norm_layer using map()\n",
"dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(5)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.map(lambda X, y: (norm_layer(X), y))"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"list(dataset.take(1)) # extra code shows the first batch"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"class MyNormalization(tf.keras.layers.Layer):\n",
" def adapt(self, X):\n",
" self.mean_ = np.mean(X, axis=0, keepdims=True)\n",
" self.std_ = np.std(X, axis=0, keepdims=True)\n",
"\n",
" def call(self, inputs):\n",
" eps = tf.keras.backend.epsilon() # a small smoothing term\n",
" return (inputs - self.mean_) / (self.std_ + eps)"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"my_norm_layer = MyNormalization()\n",
"my_norm_layer.adapt(X_train)\n",
"X_train_scaled = my_norm_layer(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `Discretization` Layer"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"age = tf.constant([[10.], [93.], [57.], [18.], [37.], [5.]])\n",
"discretize_layer = tf.keras.layers.Discretization(bin_boundaries=[18., 50.])\n",
"age_categories = discretize_layer(age)\n",
"age_categories"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"discretize_layer = tf.keras.layers.Discretization(num_bins=3)\n",
"discretize_layer.adapt(age)\n",
"age_categories = discretize_layer(age)\n",
"age_categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `CategoryEncoding` Layer"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3)\n",
"onehot_layer(age_categories)"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"two_age_categories = np.array([[1, 0], [2, 2], [2, 0]])\n",
"onehot_layer(two_age_categories)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3, output_mode=\"count\")\n",
"onehot_layer(two_age_categories)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [],
"source": [
"onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3 + 3)\n",
"onehot_layer(two_age_categories + [0, 3]) # adds 3 to the second feature"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows another way to one-hot encode each feature separately\n",
"onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3,\n",
" output_mode=\"one_hot\")\n",
"tf.keras.layers.concatenate([onehot_layer(cat)\n",
" for cat in tf.transpose(two_age_categories)])"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows another way to do this, using tf.one_hot() and Flatten\n",
"tf.keras.layers.Flatten()(tf.one_hot(two_age_categories, depth=3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `StringLookup` Layer"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"cities = [\"Auckland\", \"Paris\", \"Paris\", \"San Francisco\"]\n",
"str_lookup_layer = tf.keras.layers.StringLookup()\n",
"str_lookup_layer.adapt(cities)\n",
"str_lookup_layer([[\"Paris\"], [\"Auckland\"], [\"Auckland\"], [\"Montreal\"]])"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"str_lookup_layer = tf.keras.layers.StringLookup(num_oov_indices=5)\n",
"str_lookup_layer.adapt(cities)\n",
"str_lookup_layer([[\"Paris\"], [\"Auckland\"], [\"Foo\"], [\"Bar\"], [\"Baz\"]])"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"str_lookup_layer = tf.keras.layers.StringLookup(output_mode=\"one_hot\")\n",
"str_lookup_layer.adapt(cities)\n",
"str_lookup_layer([[\"Paris\"], [\"Auckland\"], [\"Auckland\"], [\"Montreal\"]])"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"# extra code an example using the IntegerLookup layer\n",
"ids = [123, 456, 789]\n",
"int_lookup_layer = tf.keras.layers.IntegerLookup()\n",
"int_lookup_layer.adapt(ids)\n",
"int_lookup_layer([[123], [456], [123], [111]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `Hashing` Layer"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"hashing_layer = tf.keras.layers.Hashing(num_bins=10)\n",
"hashing_layer([[\"Paris\"], [\"Tokyo\"], [\"Auckland\"], [\"Montreal\"]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Encoding Categorical Features Using Embeddings"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"tf.random.set_seed(42)\n",
"embedding_layer = tf.keras.layers.Embedding(input_dim=5, output_dim=2)\n",
"embedding_layer(np.array([2, 4, 2]))"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"tf.random.set_seed(42)\n",
"ocean_prox = [\"<1H OCEAN\", \"INLAND\", \"NEAR OCEAN\", \"NEAR BAY\", \"ISLAND\"]\n",
"str_lookup_layer = tf.keras.layers.StringLookup()\n",
"str_lookup_layer.adapt(ocean_prox)\n",
"lookup_and_embed = tf.keras.Sequential([\n",
" str_lookup_layer,\n",
" tf.keras.layers.Embedding(input_dim=str_lookup_layer.vocabulary_size(),\n",
" output_dim=2)\n",
"])\n",
"lookup_and_embed(np.array([[\"<1H OCEAN\"], [\"ISLAND\"], [\"<1H OCEAN\"]]))"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [],
"source": [
"# extra code set seeds and generates fake random data\n",
"# (feel free to load the real dataset if you prefer)\n",
"tf.random.set_seed(42)\n",
"np.random.seed(42)\n",
"X_train_num = np.random.rand(10_000, 8)\n",
"X_train_cat = np.random.choice(ocean_prox, size=10_000)\n",
"y_train = np.random.rand(10_000, 1)\n",
"X_valid_num = np.random.rand(2_000, 8)\n",
"X_valid_cat = np.random.choice(ocean_prox, size=2_000)\n",
"y_valid = np.random.rand(2_000, 1)\n",
"\n",
"num_input = tf.keras.layers.Input(shape=[8], name=\"num\")\n",
"cat_input = tf.keras.layers.Input(shape=[], dtype=tf.string, name=\"cat\")\n",
"cat_embeddings = lookup_and_embed(cat_input) \n",
"encoded_inputs = tf.keras.layers.concatenate([num_input, cat_embeddings])\n",
"outputs = tf.keras.layers.Dense(1)(encoded_inputs)\n",
"model = tf.keras.models.Model(inputs=[num_input, cat_input], outputs=[outputs])\n",
"model.compile(loss=\"mse\", optimizer=\"sgd\")\n",
"history = model.fit((X_train_num, X_train_cat), y_train, epochs=5,\n",
" validation_data=((X_valid_num, X_valid_cat), y_valid))"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows that the model can also be trained using a tf.data.Dataset\n",
"train_set = tf.data.Dataset.from_tensor_slices(\n",
" ((X_train_num, X_train_cat), y_train)).batch(32)\n",
"valid_set = tf.data.Dataset.from_tensor_slices(\n",
" ((X_valid_num, X_valid_cat), y_valid)).batch(32)\n",
"history = model.fit(train_set, epochs=5,\n",
" validation_data=valid_set)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [],
"source": [
"# extra code shows that the dataset can contain dictionaries\n",
"train_set = tf.data.Dataset.from_tensor_slices(\n",
" ({\"num\": X_train_num, \"cat\": X_train_cat}, y_train)).batch(32)\n",
"valid_set = tf.data.Dataset.from_tensor_slices(\n",
" ({\"num\": X_valid_num, \"cat\": X_valid_cat}, y_valid)).batch(32)\n",
"history = model.fit(train_set, epochs=5, validation_data=valid_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Text Preprocessing"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [],
"source": [
"train_data = [\"To be\", \"!(to be)\", \"That's the question\", \"Be, be, be.\"]\n",
"text_vec_layer = tf.keras.layers.TextVectorization()\n",
"text_vec_layer.adapt(train_data)\n",
"text_vec_layer([\"Be good!\", \"Question: be or be?\"])"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [],
"source": [
"text_vec_layer = tf.keras.layers.TextVectorization(ragged=True)\n",
"text_vec_layer.adapt(train_data)\n",
"text_vec_layer([\"Be good!\", \"Question: be or be?\"])"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [],
"source": [
"text_vec_layer = tf.keras.layers.TextVectorization(output_mode=\"tf_idf\")\n",
"text_vec_layer.adapt(train_data)\n",
"text_vec_layer([\"Be good!\", \"Question: be or be?\"])"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"2 * np.log(1 + 4 / (1 + 3))"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [],
"source": [
"1 * np.log(1 + 4 / (1 + 1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Using Pretrained Language Model Components"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow_hub as hub\n",
"\n",
"hub_layer = hub.KerasLayer(\"https://tfhub.dev/google/nnlm-en-dim50/2\")\n",
"sentence_embeddings = hub_layer(tf.constant([\"To be\", \"Not to be\"]))\n",
"sentence_embeddings.numpy().round(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Image Preprocessing Layers"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_sample_images\n",
"\n",
"images = load_sample_images()[\"images\"]\n",
"crop_image_layer = tf.keras.layers.CenterCrop(height=100, width=100)\n",
"cropped_images = crop_image_layer(images)"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"plt.imshow(images[0])\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [],
"source": [
"plt.imshow(cropped_images[0])\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TensorFlow Datasets"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow_datasets as tfds\n",
"\n",
"datasets = tfds.load(name=\"mnist\")\n",
"mnist_train, mnist_test = datasets[\"train\"], datasets[\"test\"]"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [],
"source": [
"for batch in mnist_train.shuffle(10_000, seed=42).batch(32).prefetch(1):\n",
" images = batch[\"image\"]\n",
" labels = batch[\"label\"]\n",
" # [...] do something with the images and labels"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"mnist_train = mnist_train.shuffle(10_000, seed=42).batch(32)\n",
"mnist_train = mnist_train.map(lambda items: (items[\"image\"], items[\"label\"]))\n",
"mnist_train = mnist_train.prefetch(1)"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
"train_set, valid_set, test_set = tfds.load(\n",
" name=\"mnist\",\n",
" split=[\"train[:90%]\", \"train[90%:]\", \"test\"],\n",
" as_supervised=True\n",
")\n",
"train_set = train_set.shuffle(10_000, seed=42).batch(32).prefetch(1)\n",
"valid_set = valid_set.batch(32).cache()\n",
"test_set = test_set.batch(32).cache()\n",
"tf.random.set_seed(42)\n",
"model = tf.keras.Sequential([\n",
" tf.keras.layers.Flatten(input_shape=(28, 28)),\n",
" tf.keras.layers.Dense(10, activation=\"softmax\")\n",
"])\n",
"model.compile(loss=\"sparse_categorical_crossentropy\", optimizer=\"nadam\",\n",
" metrics=[\"accuracy\"])\n",
"history = model.fit(train_set, validation_data=valid_set, epochs=5)\n",
"test_loss, test_accuracy = model.evaluate(test_set)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Exercises\n",
"\n",
"## 1. to 8.\n",
"1. Ingesting a large dataset and preprocessing it efficiently can be a complex engineering challenge. The Data API makes it fairly simple. It offers many features, including loading data from various sources (such as text or binary files), reading data in parallel from multiple sources, transforming it, interleaving the records, shuffling the data, batching it, and prefetching it.\n",
"2. Splitting a large dataset into multiple files makes it possible to shuffle it at a coarse level before shuffling it at a finer level using a shuffling buffer. It also makes it possible to handle huge datasets that do not fit on a single machine. It's also simpler to manipulate thousands of small files rather than one huge file; for example, it's easier to split the data into multiple subsets. Lastly, if the data is split across multiple files spread across multiple servers, it is possible to download several files from different servers simultaneously, which improves the bandwidth usage.\n",
"3. You can use TensorBoard to visualize profiling data: if the GPU is not fully utilized then your input pipeline is likely to be the bottleneck. You can fix it by making sure it reads and preprocesses the data in multiple threads in parallel, and ensuring it prefetches a few batches. If this is insufficient to get your GPU to 100% usage during training, make sure your preprocessing code is optimized. You can also try saving the dataset into multiple TFRecord files, and if necessary perform some of the preprocessing ahead of time so that it does not need to be done on the fly during training (TF Transform can help with this). If necessary, use a machine with more CPU and RAM, and ensure that the GPU bandwidth is large enough.\n",
"4. A TFRecord file is composed of a sequence of arbitrary binary records: you can store absolutely any binary data you want in each record. However, in practice most TFRecord files contain sequences of serialized protocol buffers. This makes it possible to benefit from the advantages of protocol buffers, such as the fact that they can be read easily across multiple platforms and languages and their definition can be updated later in a backward-compatible way.\n",
"5. The `Example` protobuf format has the advantage that TensorFlow provides some operations to parse it (the `tf.io.parse`*`example()` functions) without you having to define your own format. It is sufficiently flexible to represent instances in most datasets. However, if it does not cover your use case, you can define your own protocol buffer, compile it using `protoc` (setting the `--descriptor_set_out` and `--include_imports` arguments to export the protobuf descriptor), and use the `tf.io.decode_proto()` function to parse the serialized protobufs (see the \"Custom protobuf\" section of the notebook for an example). It's more complicated, and it requires deploying the descriptor along with the model, but it can be done.\n",
"6. When using TFRecords, you will generally want to activate compression if the TFRecord files will need to be downloaded by the training script, as compression will make files smaller and thus reduce download time. But if the files are located on the same machine as the training script, it's usually preferable to leave compression off, to avoid wasting CPU for decompression.\n",
"7. Let's look at the pros and cons of each preprocessing option:\n",
" * If you preprocess the data when creating the data files, the training script will run faster, since it will not have to perform preprocessing on the fly. In some cases, the preprocessed data will also be much smaller than the original data, so you can save some space and speed up downloads. It may also be helpful to materialize the preprocessed data, for example to inspect it or archive it. However, this approach has a few cons. First, it's not easy to experiment with various preprocessing logics if you need to generate a preprocessed dataset for each variant. Second, if you want to perform data augmentation, you have to materialize many variants of your dataset, which will use a large amount of disk space and take a lot of time to generate. Lastly, the trained model will expect preprocessed data, so you will have to add preprocessing code in your application before it calls the model. There's a risk of code duplication and preprocessing mismatch in this case.\n",
" * If the data is preprocessed with the tf.data pipeline, it's much easier to tweak the preprocessing logic and apply data augmentation. Also, tf.data makes it easy to build highly efficient preprocessing pipelines (e.g., with multithreading and prefetching). However, preprocessing the data this way will slow down training. Moreover, each training instance will be preprocessed once per epoch rather than just once if the data was preprocessed when creating the data files. Well, unless the dataset fits in RAM and you can cache it using the dataset's `cache()` method. Lastly, the trained model will still expect preprocessed data. But if you use preprocessing layers in your tf.data pipeline to handle the preprocessing step, then you can just reuse these layers in your final model (adding them after training), to avoid code duplication and preprocessing mismatch.\n",
" * If you add preprocessing layers to your model, you will only have to write the preprocessing code once for both training and inference. If your model needs to be deployed to many different platforms, you will not need to write the preprocessing code multiple times. Plus, you will not run the risk of using the wrong preprocessing logic for your model, since it will be part of the model. On the downside, preprocessing the data on the fly during training will slow things down, and each instance will be preprocessed once per epoch.\n",
"8. Let's look at how to encode categorical text features and text:\n",
" * To encode a categorical feature that has a natural order, such as a movie rating (e.g., \"bad,\" \"average,\" \"good\"), the simplest option is to use ordinal encoding: sort the categories in their natural order and map each category to its rank (e.g., \"bad\" maps to 0, \"average\" maps to 1, and \"good\" maps to 2). However, most categorical features don't have such a natural order. For example, there's no natural order for professions or countries. In this case, you can use one-hot encoding, or embeddings if there are many categories. With Keras, the `StringLookup` layer can be used for ordinal encoding (using the default `output_mode=\"int\"`), or one-hot encoding (using `output_mode=\"one_hot\"`). It can also perform multi-hot encoding (using `output_mode=\"multi_hot\"`) if you want to encode multiple categorical text features together, assuming they share the same categories and it doesn't matter which feature contributed which category. For trainable embeddings, you must first use the `StringLookup` layer to produce an ordinal encoding, then use the `Embedding` layer.\n",
" * For text, the `TextVectorization` layer is easy to use and it can work well for simple tasks, or you can use TF Text for more advanced features. However, you'll often want to use pretrained language models, which you can obtain using tools like TF Hub or Hugging Face's Transformers library. These last two options are discussed in Chapter 16."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9.\n",
"### a.\n",
"_Exercise: Load the Fashion MNIST dataset (introduced in Chapter 10); split it into a training set, a validation set, and a test set; shuffle the training set; and save each dataset to multiple TFRecord files. Each record should be a serialized `Example` protobuf with two features: the serialized image (use `tf.io.serialize_tensor()` to serialize each image), and the label. Note: for large images, you could use `tf.io.encode_jpeg()` instead. This would save a lot of space, but it would lose a bit of image quality._"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [],
"source": [
2021-10-17 04:04:08 +02:00
"(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()\n",
"X_valid, X_train = X_train_full[:5000], X_train_full[5000:]\n",
"y_valid, y_train = y_train_full[:5000], y_train_full[5000:]"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [],
"source": [
2021-10-17 04:04:08 +02:00
"tf.keras.backend.clear_session()\n",
"np.random.seed(42)\n",
"tf.random.set_seed(42)"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [],
"source": [
"train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train))\n",
"valid_set = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))\n",
"test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [],
"source": [
"def create_example(image, label):\n",
" image_data = tf.io.serialize_tensor(image)\n",
" #image_data = tf.io.encode_jpeg(image[..., np.newaxis])\n",
" return Example(\n",
" features=Features(\n",
" feature={\n",
" \"image\": Feature(bytes_list=BytesList(value=[image_data.numpy()])),\n",
" \"label\": Feature(int64_list=Int64List(value=[label])),\n",
" }))"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [],
"source": [
"for image, label in valid_set.take(1):\n",
" print(create_example(image, label))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following function saves a given dataset to a set of TFRecord files. The examples are written to the files in a round-robin fashion. To do this, we enumerate all the examples using the `dataset.enumerate()` method, and we compute `index % n_shards` to decide which file to write to. We use the standard `contextlib.ExitStack` class to make sure that all writers are properly closed whether or not an I/O error occurs while writing."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [],
"source": [
"from contextlib import ExitStack\n",
"\n",
"def write_tfrecords(name, dataset, n_shards=10):\n",
" paths = [\"{}.tfrecord-{:05d}-of-{:05d}\".format(name, index, n_shards)\n",
" for index in range(n_shards)]\n",
" with ExitStack() as stack:\n",
" writers = [stack.enter_context(tf.io.TFRecordWriter(path))\n",
" for path in paths]\n",
" for index, (image, label) in dataset.enumerate():\n",
" shard = index % n_shards\n",
" example = create_example(image, label)\n",
" writers[shard].write(example.SerializeToString())\n",
" return paths"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [],
"source": [
"train_filepaths = write_tfrecords(\"my_fashion_mnist.train\", train_set)\n",
"valid_filepaths = write_tfrecords(\"my_fashion_mnist.valid\", valid_set)\n",
"test_filepaths = write_tfrecords(\"my_fashion_mnist.test\", test_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### b.\n",
"_Exercise: Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data._"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [],
"source": [
"def preprocess(tfrecord):\n",
" feature_descriptions = {\n",
" \"image\": tf.io.FixedLenFeature([], tf.string, default_value=\"\"),\n",
" \"label\": tf.io.FixedLenFeature([], tf.int64, default_value=-1)\n",
" }\n",
" example = tf.io.parse_single_example(tfrecord, feature_descriptions)\n",
" image = tf.io.parse_tensor(example[\"image\"], out_type=tf.uint8)\n",
" #image = tf.io.decode_jpeg(example[\"image\"])\n",
" image = tf.reshape(image, shape=[28, 28])\n",
" return image, example[\"label\"]\n",
"\n",
"def mnist_dataset(filepaths, n_read_threads=5, shuffle_buffer_size=None,\n",
" n_parse_threads=5, batch_size=32, cache=True):\n",
" dataset = tf.data.TFRecordDataset(filepaths,\n",
" num_parallel_reads=n_read_threads)\n",
" if cache:\n",
" dataset = dataset.cache()\n",
" if shuffle_buffer_size:\n",
" dataset = dataset.shuffle(shuffle_buffer_size)\n",
" dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)\n",
" dataset = dataset.batch(batch_size)\n",
" return dataset.prefetch(1)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [],
"source": [
"train_set = mnist_dataset(train_filepaths, shuffle_buffer_size=60000)\n",
"valid_set = mnist_dataset(valid_filepaths)\n",
"test_set = mnist_dataset(test_filepaths)"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"for X, y in train_set.take(1):\n",
" for i in range(5):\n",
" plt.subplot(1, 5, i + 1)\n",
" plt.imshow(X[i].numpy(), cmap=\"binary\")\n",
" plt.axis(\"off\")\n",
" plt.title(str(y[i].numpy()))"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [],
"source": [
2021-10-17 04:04:08 +02:00
"tf.keras.backend.clear_session()\n",
"tf.random.set_seed(42)\n",
"np.random.seed(42)\n",
"\n",
2021-10-17 04:04:08 +02:00
"class Standardization(tf.keras.layers.Layer):\n",
" def adapt(self, data_sample):\n",
" self.means_ = np.mean(data_sample, axis=0, keepdims=True)\n",
" self.stds_ = np.std(data_sample, axis=0, keepdims=True)\n",
" def call(self, inputs):\n",
2021-10-17 04:04:08 +02:00
" return (inputs - self.means_) / (self.stds_ + tf.keras.backend.epsilon())\n",
"\n",
"standardization = Standardization(input_shape=[28, 28])\n",
"# or perhaps soon:\n",
2021-10-17 04:04:08 +02:00
"#standardization = tf.keras.layers.Normalization()\n",
"\n",
"sample_image_batches = train_set.take(100).map(lambda image, label: image)\n",
"sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()),\n",
" axis=0).astype(np.float32)\n",
"standardization.adapt(sample_images)\n",
"\n",
2021-10-17 04:04:08 +02:00
"model = tf.keras.Sequential([\n",
" standardization,\n",
2021-10-17 04:04:08 +02:00
" tf.keras.layers.Flatten(),\n",
" tf.keras.layers.Dense(100, activation=\"relu\"),\n",
" tf.keras.layers.Dense(10, activation=\"softmax\")\n",
"])\n",
"model.compile(loss=\"sparse_categorical_crossentropy\",\n",
" optimizer=\"nadam\", metrics=[\"accuracy\"])"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"logs = Path() / \"my_logs\" / \"run_\" + datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
"\n",
"tensorboard_cb = tf.keras.callbacks.TensorBoard(\n",
" log_dir=logs, histogram_freq=1, profile_batch=10)\n",
"\n",
"model.fit(train_set, epochs=5, validation_data=valid_set,\n",
" callbacks=[tensorboard_cb])"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"%load_ext tensorboard\n",
"%tensorboard --logdir=./my_logs --port=6006"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10.\n",
"_Exercise: In this exercise you will download a dataset, split it, create a `tf.data.Dataset` to load it and preprocess it efficiently, then build and train a binary classification model containing an `Embedding` layer._\n",
"\n",
"### a.\n",
"_Exercise: Download the [Large Movie Review Dataset](https://homl.info/imdb), which contains 50,000 movies reviews from the [Internet Movie Database](https://imdb.com/). The data is organized in two directories, `train` and `test`, each containing a `pos` subdirectory with 12,500 positive reviews and a `neg` subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise._"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"root = \"http://ai.stanford.edu/~amaas/data/sentiment/\"\n",
"filename = \"aclImdb_v1.tar.gz\"\n",
2021-10-17 04:04:08 +02:00
"filepath = tf.keras.utils.get_file(filename, root + filename, extract=True)\n",
"path = Path(filepath).with_name(\"aclImdb\")\n",
"path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define a `tree()` function to view the structure of the `aclImdb` directory:"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [],
"source": [
"def tree(path, level=0, indent=4, max_files=3):\n",
" if level == 0:\n",
" print(f\"{path}/\")\n",
" level += 1\n",
" sub_paths = sorted(path.iterdir())\n",
" sub_dirs = [sub_path for sub_path in sub_paths if sub_path.is_dir()]\n",
" filepaths = [sub_path for sub_path in sub_paths if not sub_path in sub_dirs]\n",
" indent_str = \" \" * indent * level\n",
" for sub_dir in sub_dirs:\n",
" print(f\"{indent_str}{sub_dir.name}/\")\n",
" tree(sub_dir, level + 1, indent)\n",
" for filepath in filepaths[:max_files]:\n",
" print(f\"{indent_str}{filepath.name}\")\n",
" if len(filepaths) > max_files:\n",
" print(f\"{indent_str}...\")"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
"tree(path)"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [],
"source": [
"def review_paths(dirpath):\n",
" return [str(path) for path in dirpath.glob(\"*.txt\")]\n",
"\n",
"train_pos = review_paths(path / \"train\" / \"pos\")\n",
"train_neg = review_paths(path / \"train\" / \"neg\")\n",
"test_valid_pos = review_paths(path / \"test\" / \"pos\")\n",
"test_valid_neg = review_paths(path / \"test\" / \"neg\")\n",
"\n",
"len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### b.\n",
"_Exercise: Split the test set into a validation set (15,000) and a test set (10,000)._"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [],
"source": [
"np.random.shuffle(test_valid_pos)\n",
"\n",
"test_pos = test_valid_pos[:5000]\n",
"test_neg = test_valid_neg[:5000]\n",
"valid_pos = test_valid_pos[5000:]\n",
"valid_neg = test_valid_neg[5000:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### c.\n",
"_Exercise: Use tf.data to create an efficient dataset for each set._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the dataset fits in memory, we can just load all the data using pure Python code and use `tf.data.Dataset.from_tensor_slices()`:"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [],
"source": [
"def imdb_dataset(filepaths_positive, filepaths_negative):\n",
" reviews = []\n",
" labels = []\n",
" for filepaths, label in ((filepaths_negative, 0), (filepaths_positive, 1)):\n",
" for filepath in filepaths:\n",
" with open(filepath) as review_file:\n",
" reviews.append(review_file.read())\n",
" labels.append(label)\n",
" return tf.data.Dataset.from_tensor_slices(\n",
" (tf.constant(reviews), tf.constant(labels)))"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"for X, y in imdb_dataset(train_pos, train_neg).take(3):\n",
" print(X)\n",
" print(y)\n",
" print()"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [],
"source": [
"%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It takes about 17 seconds to load the dataset and go through it 10 times."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-10-06 23:51:42 +02:00
"But let's pretend the dataset does not fit in memory, just to make things more interesting. Luckily, each review fits on just one line (they use `<br />` to indicate line breaks), so we can read the reviews using a `TextLineDataset`. If they didn't we would have to preprocess the input files (e.g., converting them to TFRecords). For very large datasets, it would make sense to use a tool like Apache Beam for that."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [],
"source": [
"def imdb_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):\n",
" dataset_neg = tf.data.TextLineDataset(filepaths_negative,\n",
" num_parallel_reads=n_read_threads)\n",
" dataset_neg = dataset_neg.map(lambda review: (review, 0))\n",
" dataset_pos = tf.data.TextLineDataset(filepaths_positive,\n",
" num_parallel_reads=n_read_threads)\n",
" dataset_pos = dataset_pos.map(lambda review: (review, 1))\n",
" return tf.data.Dataset.concatenate(dataset_pos, dataset_neg)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [],
"source": [
"%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it takes about 33 seconds to go through the dataset 10 times. That's much slower, essentially because the dataset is not cached in RAM, so it must be reloaded at each epoch. If you add `.cache()` just before `.repeat(10)`, you will see that this implementation will be about as fast as the previous one."
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).cache().repeat(10): pass"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [],
"source": [
"batch_size = 32\n",
"\n",
"train_set = imdb_dataset(train_pos, train_neg).shuffle(25000).batch(batch_size).prefetch(1)\n",
"valid_set = imdb_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)\n",
"test_set = imdb_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### d.\n",
"_Exercise: Create a binary classification model, using a `TextVectorization` layer to preprocess each review. If the `TextVectorization` layer is not yet available (or if you like a challenge), try to create your own custom preprocessing layer: you can use the functions in the `tf.strings` package, for example `lower()` to make everything lowercase, `regex_replace()` to replace punctuation with spaces, and `split()` to split words on spaces. You should use a lookup table to output word indices, which must be prepared in the `adapt()` method._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first write a function to preprocess the reviews, cropping them to 300 characters, converting them to lower case, then replacing `<br />` and all non-letter characters to spaces, splitting the reviews into words, and finally padding or cropping each review so it ends up with exactly `n_words` tokens:"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"def preprocess(X_batch, n_words=50):\n",
" shape = tf.shape(X_batch) * tf.constant([1, 0]) + tf.constant([0, n_words])\n",
" Z = tf.strings.substr(X_batch, 0, 300)\n",
" Z = tf.strings.lower(Z)\n",
" Z = tf.strings.regex_replace(Z, b\"<br\\\\s*/?>\", b\" \")\n",
" Z = tf.strings.regex_replace(Z, b\"[^a-z]\", b\" \")\n",
" Z = tf.strings.split(Z)\n",
" return Z.to_tensor(shape=shape, default_value=b\"<pad>\")\n",
"\n",
"X_example = tf.constant([\"It's a great, great movie! I loved it.\", \"It was terrible, run away!!!\"])\n",
"preprocess(X_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's write a second utility function that will take a data sample with the same format as the output of the `preprocess()` function, and will output the list of the top `max_size` most frequent words, ensuring that the padding token is first:"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"def get_vocabulary(data_sample, max_size=1000):\n",
" preprocessed_reviews = preprocess(data_sample).numpy()\n",
" counter = Counter()\n",
" for words in preprocessed_reviews:\n",
" for word in words:\n",
" if word != b\"<pad>\":\n",
" counter[word] += 1\n",
" return [b\"<pad>\"] + [word for word, count in counter.most_common(max_size)]\n",
"\n",
"get_vocabulary(X_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are ready to create the `TextVectorization` layer. Its constructor just saves the hyperparameters (`max_vocabulary_size` and `n_oov_buckets`). The `adapt()` method computes the vocabulary using the `get_vocabulary()` function, then it builds a `StaticVocabularyTable` (see Chapter 16 for more details). The `call()` method preprocesses the reviews to get a padded list of words for each review, then it uses the `StaticVocabularyTable` to lookup the index of each word in the vocabulary:"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [],
"source": [
2021-10-17 04:04:08 +02:00
"class TextVectorization(tf.keras.layers.Layer):\n",
" def __init__(self, max_vocabulary_size=1000, n_oov_buckets=100, dtype=tf.string, **kwargs):\n",
" super().__init__(dtype=dtype, **kwargs)\n",
" self.max_vocabulary_size = max_vocabulary_size\n",
" self.n_oov_buckets = n_oov_buckets\n",
"\n",
" def adapt(self, data_sample):\n",
" self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size)\n",
" words = tf.constant(self.vocab)\n",
" word_ids = tf.range(len(self.vocab), dtype=tf.int64)\n",
" vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)\n",
" self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets)\n",
" \n",
" def call(self, inputs):\n",
" preprocessed_inputs = preprocess(inputs)\n",
" return self.table.lookup(preprocessed_inputs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try it on our small `X_example` we defined earlier:"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [],
"source": [
"text_vectorization = TextVectorization()\n",
"\n",
"text_vectorization.adapt(X_example)\n",
"text_vectorization(X_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks good! As you can see, each review was cleaned up and tokenized, then each word was encoded as its index in the vocabulary (all the 0s correspond to the `<pad>` tokens).\n",
"\n",
"Now let's create another `TextVectorization` layer and let's adapt it to the full IMDB training set (if the training set did not fit in RAM, we could just use a smaller sample of the training set by calling `train_set.take(500)`):"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [],
"source": [
"max_vocabulary_size = 1000\n",
"n_oov_buckets = 100\n",
"\n",
"sample_review_batches = train_set.map(lambda review, label: review)\n",
"sample_reviews = np.concatenate(list(sample_review_batches.as_numpy_iterator()),\n",
" axis=0)\n",
"\n",
"text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets,\n",
" input_shape=[])\n",
"text_vectorization.adapt(sample_reviews)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2020-10-07 00:51:06 +02:00
"Let's run it on the same `X_example`, just to make sure the word IDs are larger now, since the vocabulary is bigger:"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [],
"source": [
"text_vectorization(X_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Good! Now let's take a look at the first 10 words in the vocabulary:"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [],
"source": [
"text_vectorization.vocab[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These are the most common words in the reviews."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now to build our model we will need to encode all these word IDs somehow. One approach is to create bags of words: for each review, and for each word in the vocabulary, we count the number of occurences of that word in the review. For example:"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [],
"source": [
"simple_example = tf.constant([[1, 3, 1, 0, 0], [2, 2, 0, 0, 0]])\n",
"tf.reduce_sum(tf.one_hot(simple_example, 4), axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first review has 2 times the word 0, 2 times the word 1, 0 times the word 2, and 1 time the word 3, so its bag-of-words representation is `[2, 2, 0, 1]`. Similarly, the second review has 3 times the word 0, 0 times the word 1, and so on. Let's wrap this logic in a small custom layer, and let's test it. We'll drop the counts for the word 0, since this corresponds to the `<pad>` token, which we don't care about."
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [],
"source": [
2021-10-17 04:04:08 +02:00
"class BagOfWords(tf.keras.layers.Layer):\n",
" def __init__(self, n_tokens, dtype=tf.int32, **kwargs):\n",
2020-10-07 01:20:18 +02:00
" super().__init__(dtype=dtype, **kwargs)\n",
" self.n_tokens = n_tokens\n",
" def call(self, inputs):\n",
" one_hot = tf.one_hot(inputs, self.n_tokens)\n",
" return tf.reduce_sum(one_hot, axis=1)[:, 1:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's test it:"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [],
"source": [
"bag_of_words = BagOfWords(n_tokens=4)\n",
"bag_of_words(simple_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It works fine! Now let's create another `BagOfWord` with the right vocabulary size for our training set:"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [],
"source": [
"n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for <pad>\n",
"bag_of_words = BagOfWords(n_tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're ready to train the model!"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [],
"source": [
2021-10-17 04:04:08 +02:00
"model = tf.keras.Sequential([\n",
" text_vectorization,\n",
" bag_of_words,\n",
2021-10-17 04:04:08 +02:00
" tf.keras.layers.Dense(100, activation=\"relu\"),\n",
" tf.keras.layers.Dense(1, activation=\"sigmoid\"),\n",
"])\n",
"model.compile(loss=\"binary_crossentropy\", optimizer=\"nadam\",\n",
" metrics=[\"accuracy\"])\n",
"model.fit(train_set, epochs=5, validation_data=valid_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get about 73.5% accuracy on the validation set after just the first epoch, but after that the model makes no significant progress. We will do better in Chapter 16. For now the point is just to perform efficient preprocessing using `tf.data` and Keras preprocessing layers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### e.\n",
"_Exercise: Add an `Embedding` layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model._"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compute the mean embedding for each review, and multiply it by the square root of the number of words in that review, we will need a little function. For each sentence, this function needs to compute $M \\times \\sqrt N$, where $M$ is the mean of all the word embeddings in the sentence (excluding padding tokens), and $N$ is the number of words in the sentence (also excluding padding tokens). We can rewrite $M$ as $\\dfrac{S}{N}$, where $S$ is the sum of all word embeddings (it does not matter whether or not we include the padding tokens in this sum, since their representation is a zero vector). So the function must return $M \\times \\sqrt N = \\dfrac{S}{N} \\times \\sqrt N = \\dfrac{S}{\\sqrt N \\times \\sqrt N} \\times \\sqrt N= \\dfrac{S}{\\sqrt N}$."
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [],
"source": [
"def compute_mean_embedding(inputs):\n",
" not_pad = tf.math.count_nonzero(inputs, axis=-1)\n",
" n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True) \n",
" sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))\n",
" return tf.reduce_sum(inputs, axis=1) / sqrt_n_words\n",
"\n",
"another_example = tf.constant([[[1., 2., 3.], [4., 5., 0.], [0., 0., 0.]],\n",
" [[6., 0., 0.], [0., 0., 0.], [0., 0., 0.]]])\n",
"compute_mean_embedding(another_example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check that this is correct. The first review contains 2 words (the last token is a zero vector, which represents the `<pad>` token). Let's compute the mean embedding for these 2 words, and multiply the result by the square root of 2:"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [],
"source": [
"tf.reduce_mean(another_example[0:1, :2], axis=1) * tf.sqrt(2.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks good! Now let's check the second review, which contains just one word (we ignore the two padding tokens):"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [],
"source": [
"tf.reduce_mean(another_example[1:2, :1], axis=1) * tf.sqrt(1.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perfect. Now we're ready to train our final model. It's the same as before, except we replaced the `BagOfWords` layer with an `Embedding` layer followed by a `Lambda` layer that calls the `compute_mean_embedding` layer:"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [],
"source": [
"embedding_size = 20\n",
"\n",
2021-10-17 04:04:08 +02:00
"model = tf.keras.Sequential([\n",
" text_vectorization,\n",
2021-10-17 04:04:08 +02:00
" tf.keras.layers.Embedding(input_dim=n_tokens,\n",
" output_dim=embedding_size,\n",
" mask_zero=True), # <pad> tokens => zero vectors\n",
2021-10-17 04:04:08 +02:00
" tf.keras.layers.Lambda(compute_mean_embedding),\n",
" tf.keras.layers.Dense(100, activation=\"relu\"),\n",
" tf.keras.layers.Dense(1, activation=\"sigmoid\"),\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### f.\n",
"_Exercise: Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible._"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [],
"source": [
"model.compile(loss=\"binary_crossentropy\", optimizer=\"nadam\", metrics=[\"accuracy\"])\n",
"model.fit(train_set, epochs=5, validation_data=valid_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model is not better using embeddings (but we will do better in Chapter 16). The pipeline looks fast enough (we optimized it earlier)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### g.\n",
"_Exercise: Use TFDS to load the same dataset more easily: `tfds.load(\"imdb_reviews\")`._"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow_datasets as tfds\n",
"\n",
"datasets = tfds.load(name=\"imdb_reviews\")\n",
"train_set, test_set = datasets[\"train\"], datasets[\"test\"]"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [],
"source": [
"for example in train_set.take(1):\n",
" print(example[\"text\"])\n",
" print(example[\"label\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TODO: remove?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that field 4 is interpreted as a string."
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [],
"source": [
"record_defaults=[0, np.nan, tf.constant(np.nan, dtype=tf.float64), \"Hello\", tf.constant([])]\n",
"parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults)\n",
"parsed_fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that all missing fields are replaced with their default value, when provided:"
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {},
"outputs": [],
"source": [
"parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)\n",
"parsed_fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 5th field is compulsory (since we provided `tf.constant([])` as the \"default value\"), so we get an exception if we do not provide it:"
]
},
{
"cell_type": "code",
"execution_count": 160,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" parsed_fields = tf.io.decode_csv(',,,,', record_defaults)\n",
"except tf.errors.InvalidArgumentError as ex:\n",
" print(ex)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of fields should match exactly the number of fields in the `record_defaults`:"
]
},
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)\n",
"except tf.errors.InvalidArgumentError as ex:\n",
" print(ex)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2021-10-17 03:27:34 +02:00
"version": "3.8.12"
},
"nav_menu": {
"height": "264px",
"width": "369px"
},
"toc": {
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": false
}
},
"nbformat": 4,
2020-04-06 09:13:12 +02:00
"nbformat_minor": 4
}