Merge branch 'main' of github.com:ageron/handson-ml3

main
Aurélien Geron 2022-05-18 15:46:37 +12:00
commit 830eecfb43
7 changed files with 161 additions and 303 deletions

View File

@ -11,7 +11,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*This notebook contains all the sample code and solutions to the exercices in chapter 2.*"
"*This notebook contains all the sample code and solutions to the exercises in chapter 2.*"
]
},
{
@ -3747,7 +3747,7 @@
"\n",
"preprocessing = make_column_transformer(\n",
" (num_pipeline, make_column_selector(dtype_include=np.number)),\n",
" (cat_pipeline, make_column_selector(dtype_include=np.object)),\n",
" (cat_pipeline, make_column_selector(dtype_include=object)),\n",
")"
]
},
@ -3918,7 +3918,7 @@
" (\"log\", log_pipeline, [\"total_bedrooms\", \"total_rooms\",\n",
" \"population\", \"households\", \"median_income\"]),\n",
" (\"geo\", cluster_simil, [\"latitude\", \"longitude\"]),\n",
" (\"cat\", cat_pipeline, make_column_selector(dtype_include=np.object)),\n",
" (\"cat\", cat_pipeline, make_column_selector(dtype_include=object)),\n",
" ],\n",
" remainder=default_num_pipeline) # one column remaining: housing_median_age"
]
@ -4381,7 +4381,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning:** the following cell make take a few minutes to run:"
"**Warning:** the following cell may take a few minutes to run:"
]
},
{
@ -4660,7 +4660,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning:** the following cell make take a few minutes to run:"
"**Warning:** the following cell may take a few minutes to run:"
]
},
{
@ -5214,7 +5214,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we could use a z-scores rather than t-scores—since the test set is not too small, it won't make a big difference:"
"Alternatively, we could use a z-score rather than a t-score. Since the test set is not too small, it won't make a big difference:"
]
},
{
@ -5234,7 +5234,7 @@
}
],
"source": [
"# extra code computes a confidence interval again using z-score\n",
"# extra code computes a confidence interval again using a z-score\n",
"zscore = stats.norm.ppf((1 + confidence) / 2)\n",
"zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
"np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)"
@ -5746,7 +5746,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Rather than restrict ourselves to k-Nearest Neighbors regressors, let's create a transform that accepts any regressor. For this, we can extend the `MetaEstimatorMixin` and have a required `estimator` argument in the constructor. The `fit()` method must work on a clone of this estimator, and it must also save `feature_names_in_`. The `MetaEstimatorMixin` will ensure that `estimator` is listed as a required parameters, and it will update `get_params()` and `set_params()` to make the estimator's hyperparameters available for tuning. Lastly, we create a `get_feature_names_out()` method: the output column name is the "
"Rather than restrict ourselves to k-Nearest Neighbors regressors, let's create a transformer that accepts any regressor. For this, we can extend the `MetaEstimatorMixin` and have a required `estimator` argument in the constructor. The `fit()` method must work on a clone of this estimator, and it must also save `feature_names_in_`. The `MetaEstimatorMixin` will ensure that `estimator` is listed as a required parameters, and it will update `get_params()` and `set_params()` to make the estimator's hyperparameters available for tuning. Lastly, we create a `get_feature_names_out()` method: the output column name is the ..."
]
},
{
@ -6070,7 +6070,7 @@
" self.scale_ = X.std(axis=0)\n",
" self.n_features_in_ = X.shape[1] # every estimator stores this in fit()\n",
" if hasattr(X_orig, \"columns\"):\n",
" self.feature_names_in_ = np.array(X_orig.columns, dtype=np.object)\n",
" self.feature_names_in_ = np.array(X_orig.columns, dtype=object)\n",
" return self # always return self!\n",
"\n",
" def transform(self, X):\n",
@ -6133,7 +6133,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's ensure we the transformation works as expected:"
"Now let's ensure the transformation works as expected:"
]
},
{

View File

@ -498,7 +498,7 @@
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.base import clone\n",
"\n",
"skfolds = StratifiedKFold(n_splits=3) # add shuffle=True is the dataset is not\n",
"skfolds = StratifiedKFold(n_splits=3) # add shuffle=True if the dataset is not\n",
" # already shuffled\n",
"for train_index, test_index in skfolds.split(X_train, y_train_5):\n",
" clone_clf = clone(sgd_clf)\n",
@ -1608,7 +1608,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning:** the following two cells make take a few minutes each to run:"
"**Warning:** the following two cells may take a few minutes each to run:"
]
},
{
@ -1950,7 +1950,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell may take a few minutes:"
"**Warning**: the following cell may take a few minutes to run:"
]
},
{
@ -2177,7 +2177,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we tuning the hyperparameters can help. To speed up the search, let's train only on the first 10,000 images:"
"Let's see if tuning the hyperparameters can help. To speed up the search, let's train only on the first 10,000 images:"
]
},
{
@ -2295,7 +2295,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise: _Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel. You can use the `shift()` function from the `scipy.ndimage.interpolation` module. For example, `shift(image, [2, 1], cval=0)` shifts the image two pixels down and one pixel to the right. Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set. Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called _data augmentation_ or _training set expansion_._"
"Exercise: _Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel. You can use the `shift()` function from the `scipy.ndimage` module. For example, `shift(image, [2, 1], cval=0)` shifts the image two pixels down and one pixel to the right. Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set. Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called _data augmentation_ or _training set expansion_._"
]
},
{
@ -2311,7 +2311,7 @@
"metadata": {},
"outputs": [],
"source": [
"from scipy.ndimage.interpolation import shift"
"from scipy.ndimage import shift"
]
},
{
@ -2455,23 +2455,33 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Warning**: the following cell may take a few minutes to run."
"**Warning**: the following cell may take a few minutes to run:"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": "0.9763"
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"augmented_accuracy = knn_clf.score(X_test, y_test)"
"augmented_accuracy = knn_clf.score(X_test, y_test)\n",
"augmented_accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By simply augmenting the data, we got a 0.5% accuracy boost. Perhaps this does not sound so impressive, but this actually means that the error rate dropped significantly:"
"By simply augmenting the data, we've got a 0.5% accuracy boost. Perhaps it does not sound so impressive, but it actually means that the error rate dropped significantly:"
]
},
{
@ -2558,7 +2568,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The data is already split into a training set and a test set. However, the test data does *not* contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and upload them to Kaggle to see your final score."
"The data is already split into a training set and a test set. However, the test data does *not* contain the labels: your goal is to train the best model you can on the training data, then make your predictions on the test data and upload them to Kaggle to see your final score."
]
},
{
@ -3275,7 +3285,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we could just build a CSV file with these predictions (respecting the format excepted by Kaggle), then upload it and hope for the best. But wait! We can do better than hope. Why don't we use cross-validation to have an idea of how good our model is?"
"And now we could just build a CSV file with these predictions (respecting the format expected by Kaggle), then upload it and hope for the best. But wait! We can do better than hope. Why don't we use cross-validation to have an idea of how good our model is?"
]
},
{

View File

@ -75,7 +75,7 @@ Notes:
- Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).
2. Feature selection (optional):
- Drop the attributes that provide no useful information for the task.
3. Feature engineering, where appropriates:
3. Feature engineering, where appropriate:
- Discretize continuous features.
- Decompose features (e.g., categorical, date/time, etc.).
- Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
@ -104,8 +104,8 @@ Notes:
1. Fine-tune the hyperparameters using cross-validation.
- Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or the median value? Or just drop the rows?).
- Unless there are very few hyperparamter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams ([https://goo.gl/PEFfGr](https://goo.gl/PEFfGr)))
2. Try Ensemble methods. Combining your best models will often perform better than running them invdividually.
- Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams ([https://goo.gl/PEFfGr](https://goo.gl/PEFfGr)))
2. Try Ensemble methods. Combining your best models will often perform better than running them individually.
3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.
> Don't tweak your model after measuring the generalization error: you would just start overfitting the test set.
@ -125,5 +125,5 @@ Notes:
2. Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops.
- Beware of slow degradation too: models tend to "rot" as data evolves.
- Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
- Also monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particulary important for online learning systems.
- Also monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particularly important for online learning systems.
3. Retrain your models on a regular basis on fresh data (automate as much as possible).

View File

@ -16,7 +16,7 @@ scikit-learn~=1.0.2
# Optional: the XGBoost library is only used in chapter 7
xgboost~=1.5.0
# Optional: the transformers library is only using in chapter 16
# Optional: the transformers library is only used in chapter 16
transformers~=4.16.2
##### TensorFlow-related packages

View File

@ -30,7 +30,7 @@
},
"source": [
"# Table of Contents\n",
" <p><div class=\"lev1\"><a href=\"#Plotting-your-first-graph\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Plotting your first graph</a></div><div class=\"lev1\"><a href=\"#Line-style-and-color\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Line style and color</a></div><div class=\"lev1\"><a href=\"#Saving-a-figure\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Saving a figure</a></div><div class=\"lev1\"><a href=\"#Subplots\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Subplots</a></div><div class=\"lev1\"><a href=\"#Multiple-figures\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Multiple figures</a></div><div class=\"lev1\"><a href=\"#Pyplot's-state-machine:-implicit-vs-explicit\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Pyplot's state machine: implicit <em>vs</em> explicit</a></div><div class=\"lev1\"><a href=\"#Pylab-vs-Pyplot-vs-Matplotlib\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Pylab <em>vs</em> Pyplot <em>vs</em> Matplotlib</a></div><div class=\"lev1\"><a href=\"#Drawing-text\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Drawing text</a></div><div class=\"lev1\"><a href=\"#Legends\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>Legends</a></div><div class=\"lev1\"><a href=\"#Non-linear-scales\"><span class=\"toc-item-num\">10&nbsp;&nbsp;</span>Non linear scales</a></div><div class=\"lev1\"><a href=\"#Ticks-and-tickers\"><span class=\"toc-item-num\">11&nbsp;&nbsp;</span>Ticks and tickers</a></div><div class=\"lev1\"><a href=\"#Polar-projection\"><span class=\"toc-item-num\">12&nbsp;&nbsp;</span>Polar projection</a></div><div class=\"lev1\"><a href=\"#3D-projection\"><span class=\"toc-item-num\">13&nbsp;&nbsp;</span>3D projection</a></div><div class=\"lev1\"><a href=\"#Scatter-plot\"><span class=\"toc-item-num\">14&nbsp;&nbsp;</span>Scatter plot</a></div><div class=\"lev1\"><a href=\"#Lines\"><span class=\"toc-item-num\">15&nbsp;&nbsp;</span>Lines</a></div><div class=\"lev1\"><a href=\"#Histograms\"><span class=\"toc-item-num\">16&nbsp;&nbsp;</span>Histograms</a></div><div class=\"lev1\"><a href=\"#Images\"><span class=\"toc-item-num\">17&nbsp;&nbsp;</span>Images</a></div><div class=\"lev1\"><a href=\"#Animations\"><span class=\"toc-item-num\">18&nbsp;&nbsp;</span>Animations</a></div><div class=\"lev1\"><a href=\"#Saving-animations-to-video-files\"><span class=\"toc-item-num\">19&nbsp;&nbsp;</span>Saving animations to video files</a></div><div class=\"lev1\"><a href=\"#What-next?\"><span class=\"toc-item-num\">20&nbsp;&nbsp;</span>What next?</a></div>"
" <p><div class=\"lev1\"><a href=\"#Plotting-your-first-graph\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Plotting your first graph</a></div><div class=\"lev1\"><a href=\"#Line-style-and-color\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Line style and color</a></div><div class=\"lev1\"><a href=\"#Saving-a-figure\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Saving a figure</a></div><div class=\"lev1\"><a href=\"#Subplots\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Subplots</a></div><div class=\"lev1\"><a href=\"#Multiple-figures\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Multiple figures</a></div><div class=\"lev1\"><a href=\"#Pyplot's-state-machine:-implicit-vs-explicit\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Pyplot's state machine: implicit <em>vs</em> explicit</a></div><div class=\"lev1\"><a href=\"#Pylab-vs-Pyplot-vs-Matplotlib\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Pylab <em>vs</em> Pyplot <em>vs</em> Matplotlib</a></div><div class=\"lev1\"><a href=\"#Drawing-text\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Drawing text</a></div><div class=\"lev1\"><a href=\"#Legends\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>Legends</a></div><div class=\"lev1\"><a href=\"#Non-linear-scales\"><span class=\"toc-item-num\">10&nbsp;&nbsp;</span>Non-linear scales</a></div><div class=\"lev1\"><a href=\"#Ticks-and-tickers\"><span class=\"toc-item-num\">11&nbsp;&nbsp;</span>Ticks and tickers</a></div><div class=\"lev1\"><a href=\"#Polar-projection\"><span class=\"toc-item-num\">12&nbsp;&nbsp;</span>Polar projection</a></div><div class=\"lev1\"><a href=\"#3D-projection\"><span class=\"toc-item-num\">13&nbsp;&nbsp;</span>3D projection</a></div><div class=\"lev1\"><a href=\"#Scatter-plot\"><span class=\"toc-item-num\">14&nbsp;&nbsp;</span>Scatter plot</a></div><div class=\"lev1\"><a href=\"#Lines\"><span class=\"toc-item-num\">15&nbsp;&nbsp;</span>Lines</a></div><div class=\"lev1\"><a href=\"#Histograms\"><span class=\"toc-item-num\">16&nbsp;&nbsp;</span>Histograms</a></div><div class=\"lev1\"><a href=\"#Images\"><span class=\"toc-item-num\">17&nbsp;&nbsp;</span>Images</a></div><div class=\"lev1\"><a href=\"#Animations\"><span class=\"toc-item-num\">18&nbsp;&nbsp;</span>Animations</a></div><div class=\"lev1\"><a href=\"#Saving-animations-to-video-files\"><span class=\"toc-item-num\">19&nbsp;&nbsp;</span>Saving animations to video files</a></div><div class=\"lev1\"><a href=\"#What's-next?\"><span class=\"toc-item-num\">20&nbsp;&nbsp;</span>What's next?</a></div>"
]
},
{
@ -102,7 +102,7 @@
"**Note**:\n",
"\n",
"> Matplotlib can output graphs using various backend graphics libraries, such as Tk, wxPython, etc. When running Python using the command line, you may want to specify which backend to use right after importing matplotlib and before plotting anything. For example, to use the Tk backend, run `matplotlib.use(\"TKAgg\")`.\n",
"> However, in a Jupyter notebook, things are easier: importing `import matplotlib.pyplot` automatically registers Jupyter itself as a backend, so the graphs show up directly within the notebook. This used to require running `%matplotlib inline`, so you'll still see it in some notebooks, but it's not needed anymore."
"> However, in a Jupyter notebook, things are easier: importing `import matplotlib.pyplot` automatically registers Jupyter itself as a backend, so the graphs show up directly within the notebook. It used to require running `%matplotlib inline`, so you'll still see it in some notebooks, but it's not needed anymore."
]
},
{
@ -142,7 +142,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The axes automatically match the extent of the data. We would like to give the graph a bit more room, so let's call the `axis` function to change the extent of each axis `[xmin, xmax, ymin, ymax]`."
"The axes automatically match the extent of the data. We would like to give the graph a bit more room, so let's call the `axis` function to change the extent of each axis `[xmin, xmax, ymin, ymax]`."
]
},
{
@ -281,7 +281,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass a 3rd argument to change the line's style and color.\n",
"You can pass the 3rd argument to change the line's style and color.\n",
"For example `\"g--\"` means \"green dashed line\"."
]
},
@ -382,7 +382,7 @@
"metadata": {},
"source": [
"You can also draw simple points instead of lines. Here's an example with green dashes, red dotted line and blue triangles.\n",
"Check out [the documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) for the full list of style & color options."
"Check out [the documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) for the full list of style & color options."
]
},
{
@ -413,7 +413,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot function returns a list of `Line2D` objects (one for each line). You can set extra attributes on these lines, such as the line width, the dash style or the alpha level. See the full list of attributes in [the documentation](http://matplotlib.org/users/pyplot_tutorial.html#controlling-line-properties)."
"The plot function returns a list of `Line2D` objects (one for each line). You can set extra properties on these lines, such as the line width, the dash style or the alpha level. See the full list of properties in [the documentation](https://matplotlib.org/stable/tutorials/introductory/pyplot.html#controlling-line-properties)."
]
},
{
@ -450,7 +450,7 @@
"metadata": {},
"source": [
"# Saving a figure\n",
"Saving a figure to disk is as simple as calling [`savefig`](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.savefig) with the name of the file (or a file object). The available image formats depend on the graphics backend you use."
"Saving a figure to disk is as simple as calling [`savefig`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html) with the name of the file (or a file object). The available image formats depend on the graphics backend you use."
]
},
{
@ -513,7 +513,7 @@
"plt.plot(x, x)\n",
"plt.subplot(2, 2, 2) # 2 rows, 2 columns, 2nd subplot = top right\n",
"plt.plot(x, x**2)\n",
"plt.subplot(2, 2, 3) # 2 rows, 2 columns, 3rd subplot = bottow left\n",
"plt.subplot(2, 2, 3) # 2 rows, 2 columns, 3rd subplot = bottom left\n",
"plt.plot(x, x**3)\n",
"plt.subplot(2, 2, 4) # 2 rows, 2 columns, 4th subplot = bottom right\n",
"plt.plot(x, x**4)\n",
@ -566,7 +566,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need more complex subplot positionning, you can use `subplot2grid` instead of `subplot`. You specify the number of rows and columns in the grid, then your subplot's position in that grid (top-left = (0,0)), and optionally how many rows and/or columns it spans. For example:"
"If you need more complex subplot positioning, you can use `subplot2grid` instead of `subplot`. You specify the number of rows and columns in the grid, then your subplot's position in that grid (top-left = (0,0)), and optionally how many rows and/or columns it spans. For example:"
]
},
{
@ -603,7 +603,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need even more flexibility in subplot positioning, check out the [GridSpec documentation](http://matplotlib.org/users/gridspec.html)"
"If you need even more flexibility in subplot positioning, check out the [corresponding matplotlib tutorial](https://matplotlib.org/stable/tutorials/intermediate/arranging_axes.html)."
]
},
{
@ -781,7 +781,7 @@
"\n",
"Pyplot provides a number of tools to plot graphs, including the state-machine interface to the underlying object-oriented plotting library.\n",
"\n",
"Pylab is a convenience module that imports matplotlib.pyplot and NumPy in a single name space. You will find many examples using pylab, but it is no longer recommended (because *explicit* imports are better than *implicit* ones)."
"Pylab is a convenience module that imports matplotlib.pyplot and NumPy within a single namespace. You will find many examples using pylab, but it is now [strongly discouraged](https://matplotlib.org/stable/api/index.html#module-pylab) (because *explicit* imports are better than *implicit* ones)."
]
},
{
@ -789,7 +789,7 @@
"metadata": {},
"source": [
"# Drawing text\n",
"You can call `text` to add text at any location in the graph. Just specify the horizontal and vertical coordinates and the text, and optionally some extra attributes. Any text in matplotlib may contain TeX equation expressions, see [the documentation](http://matplotlib.org/users/mathtext.html) for more details."
"You can call `text` to add text at any location in the graph. Just specify the horizontal and vertical coordinates and the text, and optionally some extra arguments. Any text in matplotlib may contain TeX equation expressions, see [the documentation](https://matplotlib.org/stable/tutorials/text/mathtext.html) for more details."
]
},
{
@ -832,9 +832,9 @@
"source": [
"* Note: `ha` is an alias for `horizontalalignment`\n",
"\n",
"For more text properties, visit [the documentation](http://matplotlib.org/users/text_props.html#text-properties).\n",
"For more text properties, visit [the documentation](https://matplotlib.org/stable/tutorials/text/text_props.html).\n",
"\n",
"It is quite frequent to annotate elements of a graph, such as the beautiful point above. The `annotate` function makes this easy: just indicate the location of the point of interest, and the position of the text, plus optionally some extra attributes for the text and the arrow."
"Every so often it is required to annotate elements of a graph, such as the beautiful point above. The `annotate` function makes it easy: just indicate the location of the point of interest, and the position of the text, plus optionally some extra arguments for the text and the arrow."
]
},
{
@ -867,7 +867,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also add a bounding box around your text by using the `bbox` attribute:"
"You can also add a bounding box around your text by using the `bbox` argument:"
]
},
{
@ -906,7 +906,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Just for fun, if you want an [xkcd](http://xkcd.com)-style plot, just draw within a `with plt.xkcd()` section:"
"Just for fun, if you want an [xkcd](https://xkcd.com)-style plot, just draw within a `with plt.xkcd()` section:"
]
},
{
@ -979,8 +979,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Non linear scales\n",
"Matplotlib supports non linear scales, such as logarithmic or logit scales."
"# Non-linear scales\n",
"Matplotlib supports non-linear scales, such as logarithmic or logit scales."
]
},
{
@ -1076,10 +1076,9 @@
"metadata": {},
"source": [
"# Ticks and tickers\n",
"The axes have little marks called \"ticks\". To be precise, \"ticks\" are the *locations* of the marks (eg. (-1, 0, 1)), \"tick lines\" are the small lines drawn at those locations, \"tick labels\" are the labels drawn next to the tick lines, and \"tickers\" are objects that are capable of deciding where to place ticks. The default tickers typically do a pretty good job at placing ~5 to 8 ticks at a reasonable distance from one another.\n",
"The axes have little marks called \"ticks\". To be precise, \"ticks\" are the *locations* of the marks (e.g. (-1, 0, 1)), \"tick lines\" are the small lines drawn at those locations, \"tick labels\" are the labels drawn next to the tick lines, and \"tickers\" are objects that are capable of deciding where to place ticks. The default tickers typically do a pretty good job at placing ~5 to 8 ticks at a reasonable distance from one another.\n",
"\n",
"But sometimes you need more control (eg. there are too many tick labels on the logit graph above). Fortunately, matplotlib gives you full control over ticks. You can even activate minor ticks.\n",
"\n"
"But sometimes you need more control (e.g. there are too many tick labels on the logit graph above). Fortunately, matplotlib gives you full control over ticks. You can even activate minor ticks.\n"
]
},
{
@ -1122,10 +1121,8 @@
"ax.xaxis.set_ticks([-2, 0, 1, 2])\n",
"ax.yaxis.set_ticks(np.arange(-5, 5, 1))\n",
"ax.yaxis.set_ticklabels([\"min\", -4, -3, -2, -1, 0, 1, 2, 3, \"max\"])\n",
"plt.title(\"Manual ticks and tick labels\\n(plus minor ticks) on the y-axis\")\n",
"\n",
"\n",
"plt.grid(True)\n",
"plt.title(\"Manual ticks and tick labels\\n(plus minor ticks) on the y-axis\")\n",
"\n",
"plt.show()"
]
@ -1135,7 +1132,7 @@
"metadata": {},
"source": [
"# Polar projection\n",
"Drawing a polar graph is as easy as setting the `projection` attribute to `\"polar\"` when creating the subplot."
"Drawing a polar graph is as easy as setting the `projection` argument to `\"polar\"` when creating the subplot."
]
},
{
@ -1172,7 +1169,7 @@
"source": [
"# 3D projection\n",
"\n",
"Plotting 3D graphs is quite straightforward. You need to import `Axes3D`, which registers the `\"3d\"` projection. Then create a subplot setting the `projection` to `\"3d\"`. This returns an `Axes3DSubplot` object, which you can use to call `plot_surface`, giving x, y, and z coordinates, plus optional attributes."
"Plotting 3D graphs is quite straightforward: when creating a subplot, set the `projection` to `\"3d\"`. It returns a 3D axes object, which you can use to call `plot_surface`, providing x, y, and z coordinates, plus other optional arguments. For more information on generating 3D plots, check out the [matplotlib tutorial](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html)."
]
},
{
@ -1196,8 +1193,6 @@
}
],
"source": [
"from mpl_toolkits.mplot3d import Axes3D\n",
"\n",
"x = np.linspace(-5, 5, 50)\n",
"y = np.linspace(-5, 5, 50)\n",
"X, Y = np.meshgrid(x, y)\n",
@ -1318,7 +1313,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And as usual there are a number of other attributes you can set, such as the fill and edge colors and the alpha level."
"And as usual there are a number of other arguments you can provide, such as the fill and edge colors and the alpha level."
]
},
{
@ -1381,11 +1376,11 @@
}
],
"source": [
"def plot_line(axis, slope, intercept, **kargs):\n",
"def plot_line(axis, slope, intercept, **kwargs):\n",
" xmin, xmax = axis.get_xlim()\n",
" plt.plot([xmin, xmax],\n",
" [xmin*slope+intercept, xmax*slope+intercept],\n",
" **kargs)\n",
" **kwargs)\n",
"\n",
"x = np.random.randn(1000)\n",
"y = 0.5*x + 5 + np.random.randn(1000) * 2\n",
@ -1486,7 +1481,7 @@
"# Images\n",
"Reading, generating and plotting images in matplotlib is quite straightforward.\n",
"\n",
"To read an image, just import the `matplotlib.image` module, and call its `imread` function, passing it the file name (or file object). This returns the image data, as a NumPy array. Let's try this with the `my_square_function.png` image we saved earlier."
"To read an image, just import the `matplotlib.image` module, and call its `imread` function, passing it the file name (or file object). It returns the image data, as a NumPy array. Let's try it with the `my_square_function.png` image we saved earlier."
]
},
{
@ -1513,7 +1508,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We have loaded a 288x432 image. Each pixel is represented by a 4-element array: red, green, blue, and alpha levels, stored as 32-bit floats between 0 and 1. Now all we need to do is to call `imshow`:"
"We have loaded a 288x432 image. Each pixel is represented by a 4-element array: red, green, blue, and alpha levels, stored as 32-bit floats between 0 and 1. Now all we need to do is to call `imshow`:"
]
},
{
@ -1546,13 +1541,6 @@
"Tadaaa! You may want to hide the axes when you are displaying an image:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the hood, `imread()` uses the Python Image Library (PIL), and Matplotlib's documentation now recommends using PIL directly:"
]
},
{
"cell_type": "code",
"execution_count": 38,
@ -1577,6 +1565,13 @@
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the hood, `imread()` uses the Python Image Library (PIL), and Matplotlib's documentation now recommends using PIL directly:"
]
},
{
"cell_type": "code",
"execution_count": 39,
@ -1677,7 +1672,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we did not provide RGB levels, the `imshow` function automatically maps values to a color gradient. By default, the color gradient goes from blue (for low values) to yellow (for high values), but you can select another color map. For example:"
"As we did not provide RGB levels, the `imshow` function automatically maps values to a color gradient. By default, the color gradient goes from blue (for low values) to yellow (for high values), but you can select another color map. For example:"
]
},
{
@ -1743,7 +1738,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the `img` array is just quite small (20x30), when the `imshow` function displays it, it grows the image to the figure's size. Imagine stretching the original image, leaving blanks between the original pixels. How does imshow fill the blanks? Well, by default, it just colors each blank pixel using the color of the nearest non-blank pixel. This technique can lead to pixelated images. If you prefer, you can use a different interpolation method, such as [bilinear interpolation](https://en.wikipedia.org/wiki/Bilinear_interpolation) to fill the blank pixels. This leads to blurry edges, which many be nicer in some cases:"
"Since the `img` array is just quite small (20x30), when the `imshow` function displays it, it grows the image to the figure's size. Imagine stretching the original image, leaving blanks between the original pixels. How does imshow fill the blanks? Well, by default, it just colors each blank pixel using the color of the nearest non-blank pixel. This technique can lead to pixelated images. If you prefer, you can use a different interpolation method, such as [bilinear interpolation](https://en.wikipedia.org/wiki/Bilinear_interpolation) to fill the blank pixels. This leads to blurry edges, which may be nicer in some cases:"
]
},
{
@ -1906,7 +1901,7 @@
"metadata": {},
"source": [
"# Saving animations to video files\n",
"Matplotlib relies on 3rd-party libraries to write videos such as [FFMPEG](https://www.ffmpeg.org/) or [ImageMagick](https://imagemagick.org/). In this example we will be using FFMPEG so be sure to install it first. To save the animation to the GIF format, you would need ImageMagick."
"Matplotlib relies on 3rd-party libraries to write videos such as [FFMPEG](https://www.ffmpeg.org/) or [ImageMagick](https://imagemagick.org/). In the following example we will be using FFMPEG so be sure to install it first. To save the animation to the GIF format, you would need ImageMagick."
]
},
{
@ -1924,8 +1919,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What next?\n",
"Now you know all the basics of matplotlib, but there are many more options available. The best way to learn more, is to visit the [gallery](http://matplotlib.org/gallery.html), look at the images, choose a plot that you are interested in, then just copy the code in a Jupyter notebook and play around with it."
"# What's next?\n",
"Now you know all the basics of matplotlib, but there are many more options available. The best way to learn more, is to visit the [gallery](https://matplotlib.org/stable/gallery/index.html), look at the images, choose a plot that you are interested in, then just copy the code in a Jupyter notebook and play around with it."
]
}
],

View File

@ -84,7 +84,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's just as easy to create a 2D array (ie. a matrix) by providing a tuple with the desired number of rows and columns. For example, here's a 3x4 matrix:"
"It's just as easy to create a 2D array (i.e. a matrix) by providing a tuple with the desired number of rows and columns. For example, here's a 3x4 matrix:"
]
},
{
@ -122,7 +122,7 @@
"* An array's list of axis lengths is called the **shape** of the array.\n",
" * For example, the above matrix's shape is `(3, 4)`.\n",
" * The rank is equal to the shape's length.\n",
"* The **size** of an array is the total number of elements, which is the product of all axis lengths (eg. 3*4=12)"
"* The **size** of an array is the total number of elements, which is the product of all axis lengths (e.g. 3*4=12)"
]
},
{
@ -275,7 +275,7 @@
"metadata": {},
"source": [
"## `np.ones`\n",
"Many other NumPy functions create `ndarrays`.\n",
"Many other NumPy functions create `ndarray`s.\n",
"\n",
"Here's a 3x4 matrix full of ones:"
]
@ -368,7 +368,7 @@
"metadata": {},
"source": [
"## np.array\n",
"Of course you can initialize an `ndarray` using a regular python array. Just call the `array` function:"
"Of course, you can initialize an `ndarray` using a regular python array. Just call the `array` function:"
]
},
{
@ -453,7 +453,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course you can provide a step parameter:"
"Of course, you can provide a step parameter:"
]
},
{
@ -480,7 +480,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"However, when dealing with floats, the exact number of elements in the array is not always predictible. For example, consider this:"
"However, when dealing with floats, the exact number of elements in the array is not always predictable. For example, consider this:"
]
},
{
@ -679,7 +679,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"NumPy first creates three `ndarrays` (one per dimension), each of shape `(3, 2, 10)`. Each array has values equal to the coordinate along a specific axis. For example, all elements in the `z` array are equal to their z-coordinate:\n",
"NumPy first creates three `ndarray`s (one per dimension), each of shape `(3, 2, 10)`. Each array has values equal to the coordinate along a specific axis. For example, all elements in the `z` array are equal to their z-coordinate:\n",
"\n",
" [[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]\n",
@ -770,7 +770,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Available data types include `int8`, `int16`, `int32`, `int64`, `uint8`|`16`|`32`|`64`, `float16`|`32`|`64` and `complex64`|`128`. Check out [the documentation](http://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html) for the full list.\n",
"Available data types include signed `int8`, `int16`, `int32`, `int64`, unsigned `uint8`|`16`|`32`|`64`, `float16`|`32`|`64` and `complex64`|`128`. Check out the documentation for the [basic types](https://numpy.org/doc/stable/user/basics.types.html) and [sized aliases](https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases) for the full list.\n",
"\n",
"## `itemsize`\n",
"The `itemsize` attribute returns the size (in bytes) of each item:"
@ -850,7 +850,7 @@
}
],
"source": [
"if (hasattr(f.data, \"tobytes\")):\n",
"if hasattr(f.data, \"tobytes\"):\n",
" data_bytes = f.data.tobytes() # python 3\n",
"else:\n",
" data_bytes = memoryview(f.data).tobytes() # python 2\n",
@ -862,7 +862,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Several `ndarrays` can share the same data buffer, meaning that modifying one will also modify the others. We will see an example in a minute."
"Several `ndarray`s can share the same data buffer, meaning that modifying one will also modify the others. We will see an example in a minute."
]
},
{
@ -1333,7 +1333,7 @@
"metadata": {},
"source": [
"Broadcasting rules are used in many NumPy operations, not just arithmetic operations, as we will see below.\n",
"For more details about broadcasting, check out [the documentation](https://docs.scipy.org/doc/numpy-dev/user/basics.broadcasting.html)."
"For more details about broadcasting, check out [the documentation](https://numpy.org/doc/stable/user/basics.broadcasting.html)."
]
},
{
@ -1384,7 +1384,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `int16` is required to represent all *possible* `int8` and `uint8` values (from -128 to 255), even though in this case a uint8 would have sufficed."
"Note that `int16` is required to represent all *possible* `int8` and `uint8` values (from -128 to 255), even though in this case a `uint8` would have sufficed."
]
},
{
@ -2258,15 +2258,15 @@
],
"source": [
"a[3] = 4000\n",
"another_slice # similary, modifying the original array does not affect the slice copy"
"another_slice # similarly, modifying the original array does not affect the slice copy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi-dimensional arrays\n",
"Multi-dimensional arrays can be accessed in a similar way by providing an index or slice for each axis, separated by commas:"
"## Multidimensional arrays\n",
"Multidimensional arrays can be accessed in a similar way by providing an index or slice for each axis, separated by commas:"
]
},
{
@ -3119,7 +3119,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This was possible because q1, q2 and q3 all have the same shape (except for the vertical axis, but that's ok since we are stacking on that axis).\n",
"It was possible because q1, q2 and q3 all have the same shape (except for the vertical axis, but that's ok since we are stacking on that axis).\n",
"\n",
"## `hstack`\n",
"We can also stack arrays horizontally using `hstack`:"
@ -3172,7 +3172,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is possible because q1 and q3 both have 3 rows. But since q2 has 4 rows, it cannot be stacked horizontally with q1 and q3:"
"It is possible because q1 and q3 both have 3 rows. But since q2 has 4 rows, it cannot be stacked horizontally with q1 and q3:"
]
},
{
@ -3691,7 +3691,7 @@
"metadata": {},
"source": [
"# Linear algebra\n",
"NumPy 2D arrays can be used to represent matrices efficiently in python. We will just quickly go through some of the main matrix operations available. For more details about Linear Algebra, vectors and matrics, go through the [Linear Algebra tutorial](math_linear_algebra.ipynb).\n",
"NumPy 2D arrays can be used to represent matrices efficiently in python. We will just quickly go through some of the main matrix operations available. For more details about Linear Algebra, vectors and matrices, go through the [Linear Algebra tutorial](math_linear_algebra.ipynb).\n",
"\n",
"## Matrix transpose\n",
"The `T` attribute is equivalent to calling `transpose()` when the rank is ≥2:"
@ -3927,7 +3927,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Caution**: as mentionned previously, `n1*n2` is *not* a matric multiplication, it is an elementwise product (also called a [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)))."
"**Caution**: as mentioned previously, `n1*n2` is *not* a matrix multiplication, it is an elementwise product (also called a [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)))."
]
},
{
@ -4019,7 +4019,7 @@
"metadata": {},
"source": [
"## Identity matrix\n",
"The product of a matrix by its inverse returns the identiy matrix (with small floating point errors):"
"The product of a matrix by its inverse returns the identity matrix (with small floating point errors):"
]
},
{
@ -4048,7 +4048,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can create an identity matrix of size NxN by calling `eye`:"
"You can create an identity matrix of size NxN by calling `eye(N)` function:"
]
},
{
@ -4637,7 +4637,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, both `X` and `Y` are 768x1024 arrays, and all values in `X` correspond to the horizontal coordinate, while all values in `Y` correspond to the the vertical coordinate.\n",
"As you can see, both `X` and `Y` are 768x1024 arrays, and all values in `X` correspond to the horizontal coordinate, while all values in `Y` correspond to the vertical coordinate.\n",
"\n",
"Now we can simply compute the result using array operations:"
]
@ -4678,7 +4678,6 @@
],
"source": [
"import matplotlib.pyplot as plt\n",
"import matplotlib.cm as cm\n",
"\n",
"fig = plt.figure(1, figsize=(7, 6))\n",
"plt.imshow(data, cmap=\"hot\")\n",
@ -4733,7 +4732,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Done! Since the file name contains no file extension was provided, NumPy automatically added `.npy`. Let's take a peek at the file content:"
"Done! Since the file name contains no file extension, NumPy automatically added `.npy`. Let's take a peek at the file content:"
]
},
{
@ -5031,8 +5030,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What next?\n",
"Now you know all the fundamentals of NumPy, but there are many more options available. The best way to learn more is to experiment with NumPy, and go through the excellent [reference documentation](http://docs.scipy.org/doc/numpy/reference/index.html) to find more functions and features you may be interested in."
"# What's next?\n",
"Now you know all the fundamentals of NumPy, but there are many more options available. The best way to learn more is to experiment with NumPy, and go through the excellent [reference documentation](https://numpy.org/doc/stable/reference/index.html) to find more functions and features you may be interested in."
]
}
],

View File

@ -54,7 +54,7 @@
"metadata": {},
"source": [
"# `Series` objects\n",
"The `pandas` library contains these useful data structures:\n",
"The `pandas` library contains the following useful data structures:\n",
"* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).\n",
"* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).\n",
"* `Panel` objects. You can see a `Panel` as a dictionary of `DataFrame`s. These are less used, so we will not discuss them here."
@ -224,7 +224,7 @@
"metadata": {},
"source": [
"## Index labels\n",
"Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:"
"Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting from `0`) but you can also set the index labels manually:"
]
},
{
@ -441,7 +441,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh look! The first element has index label `2`. The element with index label `0` is absent from the slice:"
"Oh, look! The first element has index label `2`. The element with index label `0` is absent from the slice:"
]
},
{
@ -603,7 +603,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value. (ie. Not-a-Number means *missing*).\n",
"The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value (i.e. Not-a-Number means *missing*).\n",
"\n",
"Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the right index labels, you can have surprising results:"
]
@ -745,7 +745,6 @@
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]\n",
"s7 = pd.Series(temperatures, name=\"Temperature\")\n",
@ -757,7 +756,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) section of pandas' documentation, and look at the example code."
"There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) section of pandas' documentation, and look at the example code."
]
},
{
@ -772,7 +771,7 @@
"* it can handle timezones.\n",
"\n",
"## Time range\n",
"Let's start by creating a time series using `pd.date_range()`. This returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm."
"Let's start by creating a time series using `pd.date_range()`. It returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm."
]
},
{
@ -905,7 +904,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The resampling operation is actually a deferred operation, which is why we did not get a `Series` object, but a `DatetimeIndexResampler` object instead. To actually perform the resampling operation, we can simply call the `mean()` method: Pandas will compute the mean of every pair of consecutive hours:"
"The resampling operation is actually a deferred operation, which is why we did not get a `Series` object, but a `DatetimeIndexResampler` object instead. To actually perform the resampling operation, we can simply call the `mean()` method. Pandas will compute the mean of every pair of consecutive hours:"
]
},
{
@ -1020,7 +1019,7 @@
"metadata": {},
"source": [
"## Upsampling and interpolation\n",
"This was an example of downsampling. We can also upsample (ie. increase the frequency), but this creates holes in our data:"
"It was an example of downsampling. We can also upsample (i.e. increase the frequency), but it will create holes in our data:"
]
},
{
@ -1122,7 +1121,7 @@
"metadata": {},
"source": [
"## Timezones\n",
"By default datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize()` method:"
"By default, datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize()` method:"
]
},
{
@ -1162,7 +1161,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `-04:00` is now appended to all the datetimes. This means that these datetimes refer to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) - 4 hours.\n",
"Note that `-04:00` is now appended to all the datetimes. It means that these datetimes refer to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) - 4 hours.\n",
"\n",
"We can convert these datetimes to Paris time like this:"
]
@ -1273,7 +1272,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Fortunately using the `ambiguous` argument we can tell pandas to infer the right DST (Daylight Saving Time) based on the order of the ambiguous timestamps:"
"Fortunately, by using the `ambiguous` argument we can tell pandas to infer the right DST (Daylight Saving Time) based on the order of the ambiguous timestamps:"
]
},
{
@ -1457,7 +1456,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course we can create a `Series` with a `PeriodIndex`:"
"Of course, we can create a `Series` with a `PeriodIndex`:"
]
},
{
@ -1485,7 +1484,7 @@
}
],
"source": [
"quarterly_revenue = pd.Series([300, 320, 290, 390, 320, 360, 310, 410], index = quarters)\n",
"quarterly_revenue = pd.Series([300, 320, 290, 390, 320, 360, 310, 410], index=quarters)\n",
"quarterly_revenue"
]
},
@ -1514,7 +1513,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can convert periods to timestamps by calling `to_timestamp`. By default this will give us the first day of each period, but by setting `how` and `freq`, we can get the last hour of each period:"
"We can convert periods to timestamps by calling `to_timestamp`. By default, it will give us the first day of each period, but by setting `how` and `freq`, we can get the last hour of each period:"
]
},
{
@ -1585,7 +1584,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas also provides many other time-related functions that we recommend you check out in the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html). To whet your appetite, here is one way to get the last business day of each month in 2016, at 9am:"
"Pandas also provides many other time-related functions that we recommend you check out in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). To whet your appetite, here is one way to get the last business day of each month in 2016, at 9am:"
]
},
{
@ -1998,7 +1997,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To specify missing values, you can either use `np.nan` or NumPy's masked arrays:"
"To specify missing values, you can use either `np.nan` or NumPy's masked arrays:"
]
},
{
@ -2072,7 +2071,7 @@
}
],
"source": [
"masked_array = np.ma.asarray(values, dtype=np.object)\n",
"masked_array = np.ma.asarray(values, dtype=object)\n",
"masked_array[(0, 2), (1, 2)] = np.ma.masked\n",
"d3 = pd.DataFrame(\n",
" masked_array,\n",
@ -2158,7 +2157,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possible to create a `DataFrame` with a dictionary (or list) of dictionaries (or list):"
"It is also possible to create a `DataFrame` with a dictionary (or list) of dictionaries (or lists):"
]
},
{
@ -2233,9 +2232,9 @@
],
"source": [
"people = pd.DataFrame({\n",
" \"birthyear\": {\"alice\":1985, \"bob\": 1984, \"charles\": 1992},\n",
" \"hobby\": {\"alice\":\"Biking\", \"bob\": \"Dancing\"},\n",
" \"weight\": {\"alice\":68, \"bob\": 83, \"charles\": 112},\n",
" \"birthyear\": {\"alice\": 1985, \"bob\": 1984, \"charles\": 1992},\n",
" \"hobby\": {\"alice\": \"Biking\", \"bob\": \"Dancing\"},\n",
" \"weight\": {\"alice\": 68, \"bob\": 83, \"charles\": 112},\n",
" \"children\": {\"bob\": 3, \"charles\": 0}\n",
"})\n",
"people"
@ -2333,13 +2332,13 @@
"d5 = pd.DataFrame(\n",
" {\n",
" (\"public\", \"birthyear\"):\n",
" {(\"Paris\",\"alice\"):1985, (\"Paris\",\"bob\"): 1984, (\"London\",\"charles\"): 1992},\n",
" {(\"Paris\",\"alice\"): 1985, (\"Paris\",\"bob\"): 1984, (\"London\",\"charles\"): 1992},\n",
" (\"public\", \"hobby\"):\n",
" {(\"Paris\",\"alice\"):\"Biking\", (\"Paris\",\"bob\"): \"Dancing\"},\n",
" {(\"Paris\",\"alice\"): \"Biking\", (\"Paris\",\"bob\"): \"Dancing\"},\n",
" (\"private\", \"weight\"):\n",
" {(\"Paris\",\"alice\"):68, (\"Paris\",\"bob\"): 83, (\"London\",\"charles\"): 112},\n",
" {(\"Paris\",\"alice\"): 68, (\"Paris\",\"bob\"): 83, (\"London\",\"charles\"): 112},\n",
" (\"private\", \"children\"):\n",
" {(\"Paris\", \"alice\"):np.nan, (\"Paris\",\"bob\"): 3, (\"London\",\"charles\"): 0}\n",
" {(\"Paris\", \"alice\"): np.nan, (\"Paris\",\"bob\"): 3, (\"London\",\"charles\"): 0}\n",
" }\n",
")\n",
"d5"
@ -2839,7 +2838,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (eg. there was no `bob` in `London`).\n",
"Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (e.g. there was no `bob` in `London`).\n",
"\n",
"Calling `unstack()` will do the reverse, once again creating many `NaN` values."
]
@ -3108,7 +3107,7 @@
"metadata": {},
"source": [
"## Most methods return modified copies\n",
"As you may have noticed, the `stack()` and `unstack()` methods do not modify the object they apply to. Instead, they work on a copy and return that copy. This is true of most methods in pandas."
"As you may have noticed, the `stack()` and `unstack()` methods do not modify the object they are called on. Instead, they work on a copy and return that copy. This is true of most methods in pandas."
]
},
{
@ -3479,7 +3478,7 @@
"metadata": {},
"source": [
"## Adding and removing columns\n",
"You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following work fine:"
"You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following works fine:"
]
},
{
@ -3662,7 +3661,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:"
"When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:"
]
},
{
@ -3740,7 +3739,7 @@
}
],
"source": [
"people[\"pets\"] = pd.Series({\"bob\": 0, \"charles\": 5, \"eugene\":1}) # alice is missing, eugene is ignored\n",
"people[\"pets\"] = pd.Series({\"bob\": 0, \"charles\": 5, \"eugene\": 1}) # alice is missing, eugene is ignored\n",
"people"
]
},
@ -4077,7 +4076,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assigment calls, but it does not work because the `people` object is not actually modified by the first assignment:"
"Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assignment calls, but it does not work because the `people` object is not actually modified by the first assignment:"
]
},
{
@ -4220,7 +4219,7 @@
"metadata": {},
"source": [
"## Evaluating an expression\n",
"A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed."
"A great feature supported by pandas is expression evaluation. It relies on the `numexpr` library which must be installed."
]
},
{
@ -4523,7 +4522,7 @@
"metadata": {},
"source": [
"## Sorting a `DataFrame`\n",
"You can sort a `DataFrame` by calling its `sort_index` method. By default it sorts the rows by their index label, in ascending order, but let's reverse the order:"
"You can sort a `DataFrame` by calling its `sort_index` method. By default, it sorts the rows by their index label, in ascending order, but let's reverse the order:"
]
},
{
@ -4854,7 +4853,8 @@
}
],
"source": [
"people.plot(kind = \"line\", x = \"body_mass_index\", y = [\"height\", \"weight\"])\n",
"people.sort_values(by=\"body_mass_index\", inplace=True)\n",
"people.plot(kind=\"line\", x=\"body_mass_index\", y=[\"height\", \"weight\"])\n",
"plt.show()"
]
},
@ -4884,7 +4884,7 @@
}
],
"source": [
"people.plot(kind = \"scatter\", x = \"height\", y = \"weight\", s=[40, 120, 200])\n",
"people.plot(kind=\"scatter\", x=\"height\", y=\"weight\", s=[40, 120, 200])\n",
"plt.show()"
]
},
@ -4892,7 +4892,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, there are way too many options to list here: the best option is to scroll through the [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code."
"Again, there are way too many options to list here: the best option is to scroll through the [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code."
]
},
{
@ -4900,7 +4900,7 @@
"metadata": {},
"source": [
"## Operations on `DataFrame`s\n",
"Although `DataFrame`s do not try to mimick NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:"
"Although `DataFrame`s do not try to mimic NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:"
]
},
{
@ -4977,8 +4977,8 @@
}
],
"source": [
"grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])\n",
"grades = pd.DataFrame(grades_array, columns=[\"sep\", \"oct\", \"nov\"], index=[\"alice\",\"bob\",\"charles\",\"darwin\"])\n",
"grades_array = np.array([[8, 8, 9], [10, 9, 9], [4, 8, 2], [9, 10, 10]])\n",
"grades = pd.DataFrame(grades_array, columns=[\"sep\", \"oct\", \"nov\"], index=[\"alice\", \"bob\", \"charles\", \"darwin\"])\n",
"grades"
]
},
@ -5322,7 +5322,7 @@
}
],
"source": [
"(grades > 5).all(axis = 1)"
"(grades > 5).all(axis=1)"
]
},
{
@ -5353,7 +5353,7 @@
}
],
"source": [
"(grades == 10).any(axis = 1)"
"(grades == 10).any(axis=1)"
]
},
{
@ -5692,8 +5692,8 @@
}
],
"source": [
"bonus_array = np.array([[0,np.nan,2],[np.nan,1,0],[0, 1, 0], [3, 3, 0]])\n",
"bonus_points = pd.DataFrame(bonus_array, columns=[\"oct\", \"nov\", \"dec\"], index=[\"bob\",\"colin\", \"darwin\", \"charles\"])\n",
"bonus_array = np.array([[0, np.nan, 2], [np.nan, 1, 0], [0, 1, 0], [3, 3, 0]])\n",
"bonus_points = pd.DataFrame(bonus_array, columns=[\"oct\", \"nov\", \"dec\"], index=[\"bob\", \"colin\", \"darwin\", \"charles\"])\n",
"bonus_points"
]
},
@ -5798,7 +5798,7 @@
"## Handling missing data\n",
"Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.\n",
" \n",
"Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by a any value using the `fillna()` method:"
"Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by any value using the `fillna()` method:"
]
},
{
@ -6466,7 +6466,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well I guess some teachers probably do). So let's call the `dropna()` method to get rid of rows that are full of `NaN`s:"
"There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well, I guess some teachers probably do). So let's call the `dropna()` method to get rid of rows that are full of `NaN`s:"
]
},
{
@ -7263,7 +7263,7 @@
}
],
"source": [
"pd.pivot_table(more_grades, index=\"name\", values=[\"grade\",\"bonus\"], aggfunc=np.max)"
"pd.pivot_table(more_grades, index=\"name\", values=[\"grade\", \"bonus\"], aggfunc=np.max)"
]
},
{
@ -9246,7 +9246,7 @@
"much_data = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))\n",
"large_df = pd.DataFrame(much_data, columns=list(\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"))\n",
"large_df[large_df % 16 == 0] = np.nan\n",
"large_df.insert(3,\"some_text\", \"Blabla\")\n",
"large_df.insert(3, \"some_text\", \"Blabla\")\n",
"large_df"
]
},
@ -9463,7 +9463,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course there's also a `tail()` function to view the bottom 5 rows. You can pass the number of rows you want:"
"Of course, there's also a `tail()` function to view the bottom 5 rows. You can pass the number of rows you want:"
]
},
{
@ -9594,7 +9594,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `info()` method prints out a summary of each columns contents:"
"The `info()` method prints out a summary of each column's contents:"
]
},
{
@ -10041,7 +10041,7 @@
"source": [
"my_df = pd.DataFrame(\n",
" [[\"Biking\", 68.5, 1985, np.nan], [\"Dancing\", 83.1, 1984, 3]], \n",
" columns=[\"hobby\",\"weight\",\"birthyear\",\"children\"],\n",
" columns=[\"hobby\", \"weight\", \"birthyear\", \"children\"],\n",
" index=[\"alice\", \"bob\"]\n",
")\n",
"my_df"
@ -10239,7 +10239,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you might guess, there are similar `read_json`, `read_html`, `read_excel` functions as well. We can also read data straight from the Internet. For example, let's load the top 1,000 U.S. cities from github:"
"As you might guess, there are similar `read_json`, `read_html`, `read_excel` functions as well. We can also read data straight from the Internet. For example, let's load the top 1,000 U.S. cities from GitHub:"
]
},
{
@ -10351,7 +10351,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are more options available, in particular regarding datetime format. Check out the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for more details."
"There are more options available, in particular regarding datetime format. Check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for more details."
]
},
{
@ -10361,7 +10361,7 @@
"# Combining `DataFrame`s\n",
"\n",
"## SQL-like joins\n",
"One powerful feature of pandas is it's ability to perform SQL-like joins on `DataFrame`s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple simple `DataFrame`s:"
"One powerful feature of pandas is its ability to perform SQL-like joins on `DataFrame`s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple of simple `DataFrame`s:"
]
},
{
@ -10761,7 +10761,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course `LEFT OUTER JOIN` is also available by setting `how=\"left\"`: only the cities present in the left `DataFrame` end up in the result. Similarly, with `how=\"right\"` only cities in the right `DataFrame` appear in the result. For example:"
"Of course, `LEFT OUTER JOIN` is also available by setting `how=\"left\"`: only the cities present in the left `DataFrame` end up in the result. Similarly, with `how=\"right\"` only cities in the right `DataFrame` appear in the result. For example:"
]
},
{
@ -11101,7 +11101,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (eg. 3). Pandas handles this rather gracefully:"
"Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (e.g. 3). Pandas handles this rather gracefully:"
]
},
{
@ -11573,7 +11573,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case it really does not make much sense because the indices do not align well (eg. Cleveland and San Francisco end up on the same row, because they shared the index label `3`). So let's reindex the `DataFrame`s by city name before concatenating:"
"In this case it really does not make much sense because the indices do not align well (e.g. Cleveland and San Francisco end up on the same row, because they shared the index label `3`). So let's reindex the `DataFrame`s by city name before concatenating:"
]
},
{
@ -11690,152 +11690,6 @@
"This looks a lot like a `FULL OUTER JOIN`, except that the `state` columns were not renamed to `state_x` and `state_y`, and the `city` column is now the index."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `append()` method is a useful shorthand for concatenating `DataFrame`s vertically:"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>lat</th>\n",
" <th>lng</th>\n",
" <th>population</th>\n",
" <th>state</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>San Francisco</td>\n",
" <td>37.781334</td>\n",
" <td>-122.416728</td>\n",
" <td>NaN</td>\n",
" <td>CA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>New York</td>\n",
" <td>40.705649</td>\n",
" <td>-74.008344</td>\n",
" <td>NaN</td>\n",
" <td>NY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Miami</td>\n",
" <td>25.791100</td>\n",
" <td>-80.320733</td>\n",
" <td>NaN</td>\n",
" <td>FL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Cleveland</td>\n",
" <td>41.473508</td>\n",
" <td>-81.739791</td>\n",
" <td>NaN</td>\n",
" <td>OH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Salt Lake City</td>\n",
" <td>40.755851</td>\n",
" <td>-111.896657</td>\n",
" <td>NaN</td>\n",
" <td>UT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>San Francisco</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>808976.0</td>\n",
" <td>California</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>New York</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>8363710.0</td>\n",
" <td>New-York</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Miami</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>413201.0</td>\n",
" <td>Florida</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Houston</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2242193.0</td>\n",
" <td>Texas</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" city lat lng population state\n",
"0 San Francisco 37.781334 -122.416728 NaN CA\n",
"1 New York 40.705649 -74.008344 NaN NY\n",
"2 Miami 25.791100 -80.320733 NaN FL\n",
"3 Cleveland 41.473508 -81.739791 NaN OH\n",
"4 Salt Lake City 40.755851 -111.896657 NaN UT\n",
"3 San Francisco NaN NaN 808976.0 California\n",
"4 New York NaN NaN 8363710.0 New-York\n",
"5 Miami NaN NaN 413201.0 Florida\n",
"6 Houston NaN NaN 2242193.0 Texas"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"city_loc.append(city_pop)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As always in pandas, the `append()` method does *not* actually modify `city_loc`: it works on a copy and returns the modified copy."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -12149,8 +12003,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What next?\n",
"As you probably noticed by now, pandas is quite a large library with *many* features. Although we went through the most important features, there is still a lot to discover. Probably the best way to learn more is to get your hands dirty with some real-life data. It is also a good idea to go through pandas' excellent [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html), in particular the [Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html)."
"# What's next?\n",
"As you probably noticed by now, pandas is quite a large library with *many* features. Although we went through the most important features, there is still a lot to discover. Probably the best way to learn more is to get your hands dirty with some real-life data. It is also a good idea to go through pandas' excellent [documentation](https://pandas.pydata.org/pandas-docs/stable/index.html), in particular the [Cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html)."
]
},
{