**Chapter 1 – The Machine Learning landscape**

_This contains the code example in this chapter 1, as well as all the code used to generate `lifesat.csv` and some of this chapter's figures._

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-ml2/blob/master/01_the_machine_learning_landscape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml2/blob/master/01_the_machine_learning_landscape.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Code example 1-1

In [1]:
# Python ≥3.7 is required
import sys
assert sys.version_info >= (3, 7)

In [2]:
import numpy as np

# Make this notebook's output stable across runs
np.random.seed(42)

In [3]:
# Scikit-Learn ≥1.0 is required
import sklearn
assert sklearn.__version__ >= "1.0"

In [4]:
# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib as mpl

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [5]:
# Download the data
from pathlib import Path
import urllib.request

datapath = Path() / "datasets" / "lifesat"
datapath.mkdir(parents=True, exist_ok=True)

root = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
filename = "lifesat.csv"
if not (datapath / filename).is_file():
    print("Downloading", filename)
    url = root + "datasets/lifesat/" + filename
    urllib.request.urlretrieve(url, datapath / filename)

In [6]:
# Code example
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load the data
lifesat = pd.read_csv(Path() / "datasets" / "lifesat" / "lifesat.csv")
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat[["Life satisfaction"]].values

# Visualize the data
lifesat.plot(kind='scatter',
             x="GDP per capita (USD)", y='Life satisfaction')
plt.axis([23_500, 62_500, 4, 9])
plt.grid(True)
plt.show()

# Select a linear model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[37_655.2]]  # Cyprus' GDP per capita in 2020
print(model.predict(X_new)) # outputs [[6.30165767]]

Replacing the Linear Regression model with k-Nearest Neighbors (in this example, k = 3) regression in the previous code is as simple as replacing these two
lines:

```python
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()
```

with these two:

```python
import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
```

In [7]:
# Select a 3-Nearest Neighbors regression model
import sklearn.neighbors

model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

# Train the model
model.fit(X,y)

# Make a prediction for Cyprus
print(model.predict(X_new)) # outputs [[6.33333333]]


# Note: you can safely ignore the rest of this notebook, it just generates many of the figures in chapter 1.

Create a function to save the figures:

In [8]:
# Where to save the figures
IMAGES_PATH = Path() / "images" / "fundamentals"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Load and prepare Life satisfaction data

To create `lifesat.csv`, I downloaded the Better Life Index (BLI) data from [OECD's website](http://stats.oecd.org/index.aspx?DataSetCode=BLI) (to get the Life Satisfaction for each country), and World Bank GDP per capita data from [OurWorldInData.org](https://ourworldindata.org/grapher/gdp-per-capita-worldbank). The BLI data is in `datasets/lifesat/oecd_bli.csv` (data from 2020), and the GDP per capita data is in `datasets/lifesat/gdp_per_capita.csv` (data up to 2020).

If you want to grab the latest versions, please feel free to do so. However, there may be some changes (e.g., in the column names, or different countries missing data), so be prepared to have to tweak the code.

In [9]:
for filename in ("oecd_bli.csv", "gdp_per_capita.csv"):
    if not (datapath / filename).is_file():
        print("Downloading", filename)
        url = root + "datasets/lifesat/" + filename
        urllib.request.urlretrieve(url, datapath / filename)

In [10]:
oecd_bli = pd.read_csv(datapath / "oecd_bli.csv")
gdp_per_capita = pd.read_csv(datapath / "gdp_per_capita.csv")

This function just merges the OECD's life satisfaction data and the World Bank's GDP per capita data:

In [11]:
def prepare_country_stats(oecd_bli, gdp_per_capita):
    gdp_year = 2020
    gdppc = "GDP per capita (USD)"
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita = gdp_per_capita[gdp_per_capita["Year"] == gdp_year]
    gdp_per_capita = gdp_per_capita.drop(["Code", "Year"], axis=1)
    gdp_per_capita.columns = ["Country", gdppc]
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by=gdppc, inplace=True)
    return full_country_stats[[gdppc, 'Life satisfaction']]

In [12]:
full_country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
full_country_stats.to_csv(datapath / "lifesat.csv")

To illustrate the risk of overfitting, I use only part of the data in most figures (all countries with a GDP per capita between `min_gdp` and `max_gdp`). Later in the chapter I reveal the missing countries, and show that they don't follow the same linear trend at all.

In [13]:
gdppc = "GDP per capita (USD)"
min_gdp = 23_500
max_gdp = 62_500
country_stats = full_country_stats[(full_country_stats[gdppc] >= min_gdp) &
                                   (full_country_stats[gdppc] <= max_gdp)]
country_stats.head()

In [14]:
country_stats.plot(kind='scatter', figsize=(5,3),
                   x="GDP per capita (USD)", y='Life satisfaction')

min_life_sat = 4
max_life_sat = 9

plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])
position_text = {
    "Hungary": (28_000, 4.2),
    "France": (40_000, 5),
    "New Zealand": (30_000, 8),
    "Australia": (50_000, 5.5),
    "United States": (59_000, 5.5),
    "Denmark": (46_000, 8.5)
}

for country, pos_text in position_text.items():
    pos_data_x, pos_data_y = country_stats[["GDP per capita (USD)",
                                            "Life satisfaction"]].loc[country]
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', width=0.5, shrink=0.2,
                            headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "ro")

plt.grid(True)

save_fig('money_happy_scatterplot')
plt.show()

In [15]:
highlighted_countries = country_stats.loc[list(position_text.keys())]
highlighted_countries[["Life satisfaction"]].sort_values(by="Life satisfaction")

In [16]:
import numpy as np

country_stats.plot(kind='scatter', figsize=(5,3),
                   x="GDP per capita (USD)", y='Life satisfaction')
plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])

X = np.linspace(min_gdp, max_gdp, 1000)

w1, w2 = 4.2, 0
plt.plot(X, w1 + w2 * 1e-5 * X, "r")
plt.text(40_000, 4.9, fr"$\theta_0 = {w1}$",
         fontsize=14, color="r")
plt.text(40_000, 4.4, fr"$\theta_1 = {w2}$",
         fontsize=14, color="r")

w1, w2 = 10, -9
plt.plot(X, w1 + w2 * 1e-5 * X, "g")
plt.text(26_000, 8.5, fr"$\theta_0 = {w1}$",
         fontsize=14, color="g")
plt.text(26_000, 8.0, fr"$\theta_1 = {w2} \times 10^{{-5}}$",
         fontsize=14, color="g")

w1, w2 = 3, 8
plt.plot(X, w1 + w2 * 1e-5 * X, "b")
plt.text(48_000, 8.5, fr"$\theta_0 = {w1}$",
         fontsize=14, color="b")
plt.text(48_000, 8.0, fr"$\theta_1 = {w2} \times 10^{{-5}}$",
         fontsize=14, color="b")
plt.grid(True)

save_fig('tweaking_model_params_plot')
plt.show()

In [17]:
from sklearn import linear_model

X_sample = country_stats[["GDP per capita (USD)"]].values
y_sample = country_stats[["Life satisfaction"]].values

lin1 = linear_model.LinearRegression()
lin1.fit(X_sample, y_sample)

t0, t1 = lin1.intercept_[0], lin1.coef_[0][0]
t0, t1

In [18]:
country_stats.plot(kind='scatter', figsize=(5,3),
                   x="GDP per capita (USD)", y='Life satisfaction')
plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])

X = np.linspace(min_gdp, max_gdp, 1000)
plt.plot(X, t0 + t1 * X, "b")

plt.text(max_gdp - 20_000, min_life_sat + 1.5,
         fr"$\theta_0 = {t0:.2f}$",
         fontsize=14, color="b")
plt.text(max_gdp - 20_000, min_life_sat + 1,
         fr"$\theta_1 = {t1 * 1e5:.2f} \times 10^{{-5}}$",
         fontsize=14, color="b")
plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])
plt.grid(True)

save_fig('best_fit_model_plot')
plt.show()

In [19]:
gdp_year = 2020
gdp_per_capita_clean = gdp_per_capita[gdp_per_capita["Year"] == gdp_year]
gdp_per_capita_clean = gdp_per_capita_clean.drop(["Code", "Year"], axis=1)
gdp_per_capita_clean.columns = ["Country", "GDP per capita (USD)"]
gdp_per_capita_clean.set_index("Country", inplace=True)

In [20]:
cyprus_gdp_per_capita = gdp_per_capita_clean.loc["Cyprus"]["GDP per capita (USD)"]
print(cyprus_gdp_per_capita)
cyprus_predicted_life_satisfaction = lin1.predict([[cyprus_gdp_per_capita]])[0, 0]
cyprus_predicted_life_satisfaction

In [21]:
country_stats.plot(kind='scatter', figsize=(5,3),
                   x="GDP per capita (USD)", y='Life satisfaction')

X = np.linspace(min_gdp, max_gdp, 1000)
plt.plot(X, t0 + t1 * X, "b")

plt.text(min_gdp + 15_000, max_life_sat - 1.5,
         fr"$\theta_0 = {t0:.2f}$",
         fontsize=14, color="b")
plt.text(min_gdp + 15_000, max_life_sat - 1,
         fr"$\theta_1 = {t1 * 1e5:.2f} \times 10^{{-5}}$",
         fontsize=14, color="b")

plt.plot([cyprus_gdp_per_capita, cyprus_gdp_per_capita],
         [min_life_sat, cyprus_predicted_life_satisfaction], "r--")
plt.text(cyprus_gdp_per_capita + 1000, 5.0,
         fr"Prediction = {cyprus_predicted_life_satisfaction:.2f}",
         fontsize=14, color="r")
plt.plot(cyprus_gdp_per_capita, cyprus_predicted_life_satisfaction, "ro")

plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])
plt.grid(True)

save_fig('cyprus_prediction_plot')
plt.show()

In [22]:
missing_data = full_country_stats[(full_country_stats[gdppc] < min_gdp) |
                                  (full_country_stats[gdppc] > max_gdp)]
missing_data

In [23]:
position_text2 = {
    "South Africa": (20_000, 4.2),
    "Colombia": (6_000, 8.2),
    "Brazil": (18_000, 7.8),
    "Mexico": (24_000, 7.4),
    "Chile": (30_000, 7.0),
    "Norway": (60_000, 6.2),
    "Switzerland": (65_000, 5.7),
    "Ireland": (80_000, 5.5),
    "Luxembourg": (100_000, 5.0),
}

In [24]:
full_country_stats.plot(kind='scatter', figsize=(8,3),
                        x="GDP per capita (USD)", y='Life satisfaction')

for country, pos_text in position_text2.items():
    pos_data_x, pos_data_y = missing_data.loc[country]
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', width=0.5, shrink=0.1,
                            headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "rs")

X = np.linspace(0, 115_000, 1000)
plt.plot(X, t0 + t1 * X, "b:")

lin_reg_full = linear_model.LinearRegression()
Xfull = np.c_[full_country_stats["GDP per capita (USD)"]]
yfull = np.c_[full_country_stats["Life satisfaction"]]
lin_reg_full.fit(Xfull, yfull)

t0full, t1full = lin_reg_full.intercept_[0], lin_reg_full.coef_[0][0]
X = np.linspace(0, 115_000, 1000)
plt.plot(X, t0full + t1full * X, "k")

plt.axis([0, 115_000, min_life_sat, max_life_sat])
plt.grid(True)

save_fig('representative_training_data_scatterplot')
plt.show()

In [25]:
from sklearn import preprocessing
from sklearn import pipeline

full_country_stats.plot(kind='scatter', figsize=(8,3),
                        x="GDP per capita (USD)", y='Life satisfaction')
plt.axis([0, 115_000, min_life_sat, max_life_sat])

poly = preprocessing.PolynomialFeatures(degree=10, include_bias=False)
scaler = preprocessing.StandardScaler()
lin_reg2 = linear_model.LinearRegression()

pipeline_reg = pipeline.Pipeline([
    ('poly', poly),
    ('scal', scaler),
    ('lin', lin_reg2)])
pipeline_reg.fit(Xfull, yfull)
curve = pipeline_reg.predict(X[:, np.newaxis])
plt.plot(X, curve)
plt.grid(True)

save_fig('overfitting_model_plot')
plt.show()

In [26]:
w_countries = [c for c in full_country_stats.index if "W" in c.upper()]
full_country_stats.loc[w_countries]["Life satisfaction"]

In [27]:
all_w_countries = [c for c in gdp_per_capita_clean.index if "W" in c.upper()]
gdp_per_capita_clean.loc[all_w_countries].sort_values(by=gdppc)

In [28]:
plt.figure(figsize=(8,3))

plt.xlabel("GDP per capita (USD)")
plt.ylabel('Life satisfaction')

plt.plot(list(country_stats["GDP per capita (USD)"]),
         list(country_stats["Life satisfaction"]), "bo")
plt.plot(list(missing_data["GDP per capita (USD)"]),
         list(missing_data["Life satisfaction"]), "rs")

X = np.linspace(0, 115_000, 1000)
plt.plot(X, t0full + t1full * X, "r--", label="Linear model on all data")
plt.plot(X, t0 + t1*X, "b:", label="Linear model on partial data")

ridge = linear_model.Ridge(alpha=10**9.5)
Xsample = country_stats[["GDP per capita (USD)"]]
ysample = country_stats[["Life satisfaction"]]
ridge.fit(Xsample, ysample)
t0ridge, t1ridge = ridge.intercept_[0], ridge.coef_[0][0]
plt.plot(X, t0ridge + t1ridge * X, "b", label="Regularized linear model on partial data")

plt.legend(loc="lower right")
plt.axis([0, 115_000, 0, 10])
plt.xlabel("GDP per capita (USD)")

plt.axis([0, 115_000, min_life_sat, max_life_sat])
plt.grid(True)

save_fig('ridge_model_plot')
plt.show()