Add slides and exercises from third block

This commit is contained in:
Michael Schären 2026-06-05 15:51:04 +02:00
parent 1e4841948c
commit d402fc8b4e
18 changed files with 246933 additions and 0 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,520 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Quality\n",
"## Authors: Dr. Yves Staudt\n",
"In diesem Kurs beschreiben wir die Qualität des Datensatzes. Dabei gehen wir nur auf spezifische Punkte ein. Wir beschreiben objektiv die Datengrösse und die Anzahl fehlende Werte. Eine kleine subjektive Bewertung wird auch vorgenommen. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Packages "
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T11:41:51.610970269Z",
"start_time": "2026-06-04T11:41:51.593419401Z"
}
},
"source": [
"import pandas as pd"
],
"outputs": [],
"execution_count": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:23:49.110409815Z",
"start_time": "2026-06-04T12:23:49.061592247Z"
}
},
"source": "df = pd.read_csv(\"../data/lego/lego_daten_tidy.csv\")",
"outputs": [],
"execution_count": 38
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:23:50.188281347Z",
"start_time": "2026-06-04T12:23:50.136528015Z"
}
},
"source": [
"df.head()"
],
"outputs": [
{
"data": {
"text/plain": [
" name price \\\n",
"0 LEGO Harry Potter Schloss Hogwarts™ mit Schlos... 179.00 \n",
"1 LEGO Disney - Kamera Hommage an Walt Disney ... 119.00 \n",
"2 LEGO Technic - Audi RS Q e-tron (42160) 159.00 \n",
"3 LEGO Technic - Lamborghini Huracán Tecnica (42... 59.95 \n",
"4 LEGO City - Personen-Schnellzug (60337) 169.00 \n",
"\n",
" link_picture Produkttyp \\\n",
"0 https://www.fcw.ch/api/ProcessRequest/191538/B... Bausatz \n",
"1 https://www.fcw.ch/api/ProcessRequest/191541/B... Bausatz \n",
"2 https://www.fcw.ch/api/ProcessRequest/190646/B... Bausatz \n",
"3 https://www.fcw.ch/api/ProcessRequest/190611/B... Bausatz \n",
"4 https://www.fcw.ch/api/ProcessRequest/138905/B... Bausatz \n",
"\n",
" Vorgeschlagenes Geschlecht Empfohlenes Alter in Jahren (mind.) \\\n",
"0 Junge/Mädchen 18.0 \n",
"1 Junge/Mädchen 18.0 \n",
"2 Junge/Mädchen 10.0 \n",
"3 Junge/Mädchen 9.0 \n",
"4 Junge/Mädchen 7.0 \n",
"\n",
" Empfohlenes Alter in Jahren (max.) Anzahl Teile Sound-Effekte \\\n",
"0 99.0 2660.0 \n",
"1 99.0 811.0 \n",
"2 99.0 914.0 \n",
"3 99.0 806.0 \n",
"4 99.0 764.0 \n",
"\n",
" Produktfarbe ... Anzahl Produkte pro Versandkarton \\\n",
"0 Mehrfarbig ... NaN \n",
"1 Mehrfarbig ... NaN \n",
"2 Mehrfarbig ... NaN \n",
"3 Mehrfarbig ... NaN \n",
"4 Mehrfarbig ... NaN \n",
"\n",
" Hauptkarton GTIN (EAN/UPC) Lagenanzahl pro Palette \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Produkte pro Palettenlage Fernbedienung erforderlich \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Warnung vor Erstickungsgefahr Fernbedienung enthalten \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Akkus/Batterien enthalten null \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" LegoCategory \n",
"0 LEGO Harry Potter Schloss Hogwarts™ mit Schlos... \n",
"1 LEGO Disney \n",
"2 LEGO Technic \n",
"3 LEGO Technic \n",
"4 LEGO City \n",
"\n",
"[5 rows x 47 columns]"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>price</th>\n",
" <th>link_picture</th>\n",
" <th>Produkttyp</th>\n",
" <th>Vorgeschlagenes Geschlecht</th>\n",
" <th>Empfohlenes Alter in Jahren (mind.)</th>\n",
" <th>Empfohlenes Alter in Jahren (max.)</th>\n",
" <th>Anzahl Teile</th>\n",
" <th>Sound-Effekte</th>\n",
" <th>Produktfarbe</th>\n",
" <th>...</th>\n",
" <th>Anzahl Produkte pro Versandkarton</th>\n",
" <th>Hauptkarton GTIN (EAN/UPC)</th>\n",
" <th>Lagenanzahl pro Palette</th>\n",
" <th>Produkte pro Palettenlage</th>\n",
" <th>Fernbedienung erforderlich</th>\n",
" <th>Warnung vor Erstickungsgefahr</th>\n",
" <th>Fernbedienung enthalten</th>\n",
" <th>Akkus/Batterien enthalten</th>\n",
" <th>null</th>\n",
" <th>LegoCategory</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>LEGO Harry Potter Schloss Hogwarts™ mit Schlos...</td>\n",
" <td>179.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/191538/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>18.0</td>\n",
" <td>99.0</td>\n",
" <td>2660.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Harry Potter Schloss Hogwarts™ mit Schlos...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>LEGO Disney - Kamera Hommage an Walt Disney ...</td>\n",
" <td>119.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/191541/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>18.0</td>\n",
" <td>99.0</td>\n",
" <td>811.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Disney</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>LEGO Technic - Audi RS Q e-tron (42160)</td>\n",
" <td>159.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/190646/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>10.0</td>\n",
" <td>99.0</td>\n",
" <td>914.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Technic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>LEGO Technic - Lamborghini Huracán Tecnica (42...</td>\n",
" <td>59.95</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/190611/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>9.0</td>\n",
" <td>99.0</td>\n",
" <td>806.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Technic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>LEGO City - Personen-Schnellzug (60337)</td>\n",
" <td>169.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/138905/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>7.0</td>\n",
" <td>99.0</td>\n",
" <td>764.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO City</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 47 columns</p>\n",
"</div>"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Grösse des Datensatzes"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:24:51.165046097Z",
"start_time": "2026-06-04T12:24:51.100237600Z"
}
},
"source": [
"df.shape"
],
"outputs": [
{
"data": {
"text/plain": [
"(890, 47)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Der Datensatz stellt mit 119'390 Beobachtungen und 32 Variablen einen guten Datensatz mit deren Grössen für einen Ansatz mit Machine Learning Modellen. Es gibt keine Regeln die sagen ab, wann ein Datensatz zu klein ist. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Anzahl fehlende Werte"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die Methode df.isnull() liefert uns den gleichen Datensatz mit boolschen Werten, welche wiedergeben ob die einzelnen Einträge einen Wert besitzen (False bzw. 0) oder NaN sind (True bzw. 1). Da die boolschen Werte von Python bei Berechnungen als 1 für True und 0 für False dargestellt werden können, wird das arithmetische Mittel .mean() verwendet, um den Anteil fehlender Einträge für jede Variable zu errechnen."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:24:53.388897605Z",
"start_time": "2026-06-04T12:24:53.280296245Z"
}
},
"source": [
"# Berechnung des Anteiles fehlender Einträge \n",
"missing_values = df.isnull().mean() * 100 # # Errechnung der Anteile fehlender Werte für jede Spalte bzw. Variable\n",
"missing_values = missing_values[missing_values > 0] # Filtern der Spalten ohne fehlende Einträge\n",
"\n",
"# Sortieren der Spalten nach höchsten Anteilen von fehlenden Werten\n",
"missing_values = missing_values.sort_values(ascending=False)\n",
"\n",
"# Ausgabe des finalen Dataframes\n",
"print(missing_values)"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Freigabedatum (TT/MM/JJ) 87.528090\n",
"Warnung vor Erstickungsgefahr 87.303371\n",
"Fernbedienung erforderlich 87.078652\n",
"Fernbedienung enthalten 87.078652\n",
"Anzahl der Versandkartons pro Vorlagenkarton 86.067416\n",
"Anzahl der Pakete 85.505618\n",
"EU TSD Sprache 85.168539\n",
"null 84.606742\n",
"Nettogewicht Hauptkarton 84.044944\n",
"Figur enthalten 83.483146\n",
"Menge pro Versandkarton 83.258427\n",
"Produkte pro Palettenlage 82.696629\n",
"Nachhaltigkeitszertifikate 82.471910\n",
"Breite des Versandkartons 82.247191\n",
"Lagenanzahl pro Palette 82.134831\n",
"Länge des Versandkartons 82.134831\n",
"Gewicht Versandkarton 82.134831\n",
"Versandkarton pro Palettenlage 82.134831\n",
"Anzahl Produkte pro Versandkarton 82.134831\n",
"Hauptkarton GTIN (EAN/UPC) 82.134831\n",
"Höhe des Versandkartons 82.134831\n",
"Menge pro Packung 78.876404\n",
"Montagezeit 78.876404\n",
"Akkus/Batterien enthalten 51.348315\n",
"ASIN 27.865169\n",
"Batterien erforderlich 23.483146\n",
"Sicherheitswarnung 22.584270\n",
"Montage erforderlich 21.685393\n",
"Material 21.573034\n",
"Ursprungsland 21.573034\n",
"Vorgeschlagenes Geschlecht 16.853933\n",
"Paketgewicht 16.292135\n",
"Verpackungshöhe 16.292135\n",
"Verpackungstiefe 16.179775\n",
"Verpackungsbreite 16.179775\n",
"Empfohlenes Alter in Jahren (max.) 15.730337\n",
"Sound-Effekte 15.617978\n",
"EU TSD Warnung 15.280899\n",
"Verpackungsart 14.382022\n",
"Anzahl Teile 14.044944\n",
"Empfohlenes Alter in Jahren (mind.) 13.820225\n",
"Produktfarbe 13.595506\n",
"Produkttyp 13.595506\n",
"dtype: float64\n"
]
}
],
"execution_count": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die geringe Anzahl fehlende Werte der Spalten `country` und `children` können ohne weiteres entfernt oder gefüllt werden. Hingegen die Variablen `company` und `agent` zeigen eine hohe Anzahl fehlender Werte auf. Deren fehlende Einträge sollten daher nicht gefüllt werden. Die betroffenen Datenpunkte zeigen eine Spezifikation auf, in diesem Fall hat das Fehlen einer Information vermutlich eine reale Bedeutung. In diesem Fall bedeutet ein fehlender Wert in der Variable `company` und `agent`, dass die betroffenen Buchungen nicht von einem Agenten oder einem Buchungsbüro bzw. einer Firma durchgeführt worden ist."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wenn ich die Variable `company` als Referenz nehme, deren Wert zu 94 % fehlt, dann bleiben mir nur 6 % der vollständigen Beobachtungen. Das deutet objektiv darauf hin, dass der Datensatz von schlechter Qualität ist.\n",
"\n",
"Betrachte ich jedoch eine alternative objektive Messung den Anteil vollständiger Variablen erhalte ich einen Wert von 87,5 %, was keineswegs schlecht ist.\n",
"Der Anteil vollständiger Variablen wird wie folgt berechnet: \n",
"1 - (Anzahl der Variablen mit fehlenden Werten / Gesamtanzahl der Variablen)"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:17:35.272150011Z",
"start_time": "2026-06-04T12:17:35.207528254Z"
}
},
"source": "print(f\"Die Prozentanzahl vollständige Variablen: {1 - len(missing_values)/df.shape[1]}\")",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Die Prozentanzahl vollständige Variablen: 0.06521739130434778\n"
]
}
],
"execution_count": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Subjektive Datenanalyse\n",
"Wenn man sich mit dem Datensatz und die gegeben Literaturen anschaut, merkt man, dass der Datensatz einige Schwächen aufzeigt. Der Datensatz stellt eine relatistische Problematik dar, welche für solche Zwecke die Gegebenheiten der Praxis wieder gibt. Aus diesem Grund stellt dieser Datensatz eine gute Ausgangslage für die zur analysierenden Fragestellung. Somit schätze ich die subjektive Gegebenheit mit den gegebenen Varialben als gut ein."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Referenzen \n",
"- Nuno Antonio, Ana Almeida, and Luis Nunes, Predicting Hotel Bookings Cancellation With a Machine Learning Classification Model, 16th IEEE Internation Conference on Machine Learning and Applications, 2017a\n",
"- Nuno Antonio, Ana Almeida, and Luis Nunes, Predicting hotel booking cancellations to decrease uncertainty and increase revenue, Tourism and Management Studies, Volume 13, Issue 2, 2017b\n",
"- Nuno Antonio, Ana Almeida, and Luis Nunes, Big Data in Hotel Revenue Management: Exploring Cancellation Drivers to Gain Insights Into Booking Cancellation Behavior, Cornell Hospitality Quarterly, Volume 60 (4), 2019a\n",
"- Nuno Antonio, Ana Almeida, and Luis Nunes, Hotel booking demand datasets, Data in Brief, Volume 22, 2019b"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,567 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7cff0901",
"metadata": {},
"source": [
"# Datenqualität — Musterlösung (Lego)\n",
"\n",
"**Modul M04 — Data Strategy, Quality & Tidy Data · `lego_daten_tidy.csv`** \n",
"Prof. Dr. Yves Staudt\n",
"\n",
"Wir beurteilen die **objektive** (messbare) und **subjektive** (zweckabhängige)\n",
"Datenqualität und prüfen, ob der Datensatz **tidy** ist."
]
},
{
"cell_type": "markdown",
"id": "ef13c590",
"metadata": {},
"source": [
"## Pakete & Daten"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ca37b23b",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-03T21:52:04.296931Z",
"iopub.status.busy": "2026-06-03T21:52:04.296673Z",
"iopub.status.idle": "2026-06-03T21:52:04.789460Z",
"shell.execute_reply": "2026-06-03T21:52:04.789262Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(890, 47)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>price</th>\n",
" <th>link_picture</th>\n",
" <th>Produkttyp</th>\n",
" <th>Vorgeschlagenes Geschlecht</th>\n",
" <th>Empfohlenes Alter in Jahren (mind.)</th>\n",
" <th>Empfohlenes Alter in Jahren (max.)</th>\n",
" <th>Anzahl Teile</th>\n",
" <th>Sound-Effekte</th>\n",
" <th>Produktfarbe</th>\n",
" <th>...</th>\n",
" <th>Anzahl Produkte pro Versandkarton</th>\n",
" <th>Hauptkarton GTIN (EAN/UPC)</th>\n",
" <th>Lagenanzahl pro Palette</th>\n",
" <th>Produkte pro Palettenlage</th>\n",
" <th>Fernbedienung erforderlich</th>\n",
" <th>Warnung vor Erstickungsgefahr</th>\n",
" <th>Fernbedienung enthalten</th>\n",
" <th>Akkus/Batterien enthalten</th>\n",
" <th>null</th>\n",
" <th>LegoCategory</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>LEGO Harry Potter Schloss Hogwarts™ mit Schlos...</td>\n",
" <td>179.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/191538/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>18.0</td>\n",
" <td>99.0</td>\n",
" <td>2660.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Harry Potter Schloss Hogwarts™ mit Schlos...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>LEGO Disney - Kamera Hommage an Walt Disney ...</td>\n",
" <td>119.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/191541/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>18.0</td>\n",
" <td>99.0</td>\n",
" <td>811.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Disney</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>LEGO Technic - Audi RS Q e-tron (42160)</td>\n",
" <td>159.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/190646/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>10.0</td>\n",
" <td>99.0</td>\n",
" <td>914.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Technic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>LEGO Technic - Lamborghini Huracán Tecnica (42...</td>\n",
" <td>59.95</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/190611/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>9.0</td>\n",
" <td>99.0</td>\n",
" <td>806.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Technic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>LEGO City - Personen-Schnellzug (60337)</td>\n",
" <td>169.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/138905/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>7.0</td>\n",
" <td>99.0</td>\n",
" <td>764.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO City</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 47 columns</p>\n",
"</div>"
],
"text/plain": [
" name price \\\n",
"0 LEGO Harry Potter Schloss Hogwarts™ mit Schlos... 179.00 \n",
"1 LEGO Disney - Kamera Hommage an Walt Disney ... 119.00 \n",
"2 LEGO Technic - Audi RS Q e-tron (42160) 159.00 \n",
"3 LEGO Technic - Lamborghini Huracán Tecnica (42... 59.95 \n",
"4 LEGO City - Personen-Schnellzug (60337) 169.00 \n",
"\n",
" link_picture Produkttyp \\\n",
"0 https://www.fcw.ch/api/ProcessRequest/191538/B... Bausatz \n",
"1 https://www.fcw.ch/api/ProcessRequest/191541/B... Bausatz \n",
"2 https://www.fcw.ch/api/ProcessRequest/190646/B... Bausatz \n",
"3 https://www.fcw.ch/api/ProcessRequest/190611/B... Bausatz \n",
"4 https://www.fcw.ch/api/ProcessRequest/138905/B... Bausatz \n",
"\n",
" Vorgeschlagenes Geschlecht Empfohlenes Alter in Jahren (mind.) \\\n",
"0 Junge/Mädchen 18.0 \n",
"1 Junge/Mädchen 18.0 \n",
"2 Junge/Mädchen 10.0 \n",
"3 Junge/Mädchen 9.0 \n",
"4 Junge/Mädchen 7.0 \n",
"\n",
" Empfohlenes Alter in Jahren (max.) Anzahl Teile Sound-Effekte \\\n",
"0 99.0 2660.0 \n",
"1 99.0 811.0 \n",
"2 99.0 914.0 \n",
"3 99.0 806.0 \n",
"4 99.0 764.0 \n",
"\n",
" Produktfarbe ... Anzahl Produkte pro Versandkarton \\\n",
"0 Mehrfarbig ... NaN \n",
"1 Mehrfarbig ... NaN \n",
"2 Mehrfarbig ... NaN \n",
"3 Mehrfarbig ... NaN \n",
"4 Mehrfarbig ... NaN \n",
"\n",
" Hauptkarton GTIN (EAN/UPC) Lagenanzahl pro Palette \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Produkte pro Palettenlage Fernbedienung erforderlich \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Warnung vor Erstickungsgefahr Fernbedienung enthalten \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Akkus/Batterien enthalten null \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" LegoCategory \n",
"0 LEGO Harry Potter Schloss Hogwarts™ mit Schlos... \n",
"1 LEGO Disney \n",
"2 LEGO Technic \n",
"3 LEGO Technic \n",
"4 LEGO City \n",
"\n",
"[5 rows x 47 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"df = pd.read_csv(\"../../../Data/lego/lego_daten_tidy.csv\")\n",
"print(df.shape)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "58a491c1",
"metadata": {},
"source": [
"## 1 Objektive Datenqualität\n",
"\n",
"Messbar und nutzerunabhängig: **Vollständigkeit**, **Konsistenz**, **Genauigkeit**."
]
},
{
"cell_type": "markdown",
"id": "bdd7106b",
"metadata": {},
"source": [
"### Vollständigkeit (fehlende Werte)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dbd27974",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-03T21:52:04.790694Z",
"iopub.status.busy": "2026-06-03T21:52:04.790633Z",
"iopub.status.idle": "2026-06-03T21:52:04.793783Z",
"shell.execute_reply": "2026-06-03T21:52:04.793592Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gesamtanteil fehlender Werte: 49.1 %\n",
"\n",
"Spalten mit den meisten Lücken:\n",
"Freigabedatum (TT/MM/JJ) 87.5\n",
"Warnung vor Erstickungsgefahr 87.3\n",
"Fernbedienung erforderlich 87.1\n",
"Fernbedienung enthalten 87.1\n",
"Anzahl der Versandkartons pro Vorlagenkarton 86.1\n",
"Anzahl der Pakete 85.5\n",
"EU TSD Sprache 85.2\n",
"null 84.6\n",
"Nettogewicht Hauptkarton 84.0\n",
"Figur enthalten 83.5\n",
"dtype: float64\n"
]
}
],
"source": [
"na = (df.isna().mean()*100).round(1)\n",
"print('Gesamtanteil fehlender Werte: %.1f %%' % (df.isna().mean().mean()*100))\n",
"print('\\nSpalten mit den meisten Lücken:')\n",
"print(na.sort_values(ascending=False).head(10))"
]
},
{
"cell_type": "markdown",
"id": "a750b3c3",
"metadata": {},
"source": [
"**Evaluation.** Viele Spalten haben hohe Fehlquoten (Versand-/Verpackungsangaben oft >80\\,%). Für eine Preisanalyse sind v.\\,a. `price`, `Anzahl Teile`, `Empfohlenes Alter` relevant — diese sind deutlich vollständiger."
]
},
{
"cell_type": "markdown",
"id": "290f7886",
"metadata": {},
"source": [
"### Konsistenz (Duplikate, konstante Spalten, Typen)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "049719ac",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-03T21:52:04.794890Z",
"iopub.status.busy": "2026-06-03T21:52:04.794813Z",
"iopub.status.idle": "2026-06-03T21:52:04.800139Z",
"shell.execute_reply": "2026-06-03T21:52:04.799954Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Doppelte Zeilen: 0\n",
"Konstante Spalten (kein Informationsgehalt): 12\n",
" ['Sound-Effekte', 'Material', 'Batterien erforderlich', 'Menge pro Packung', 'Montage erforderlich', 'Anzahl der Pakete', 'Figur enthalten', 'Fernbedienung erforderlich']\n",
"Datentypen:\n",
"object 36\n",
"float64 11\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"dups = df.duplicated().sum()\n",
"const = [c for c in df.columns if df[c].nunique(dropna=True) <= 1]\n",
"print('Doppelte Zeilen:', int(dups))\n",
"print('Konstante Spalten (kein Informationsgehalt):', len(const))\n",
"print(' ', const[:8])\n",
"print('Datentypen:'); print(df.dtypes.value_counts())"
]
},
{
"cell_type": "markdown",
"id": "658c1011",
"metadata": {},
"source": [
"**Evaluation.** Es gibt **12 konstante** Spalten (inkl. einer Spalte `null`) — redundant/inkonsistent erhoben. Mehrere eigentlich numerische Grössen sind als Text (`object`) gespeichert — ein Konsistenzproblem für spätere Berechnungen."
]
},
{
"cell_type": "markdown",
"id": "5aa5a284",
"metadata": {},
"source": [
"### Genauigkeit (Plausibilität)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "735e79b3",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-03T21:52:04.801270Z",
"iopub.status.busy": "2026-06-03T21:52:04.801194Z",
"iopub.status.idle": "2026-06-03T21:52:04.803055Z",
"shell.execute_reply": "2026-06-03T21:52:04.802889Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"price: min=3.00 max=949.00 median=54.95\n",
"negative oder Null-Preise: 0\n"
]
}
],
"source": [
"print('price: min=%.2f max=%.2f median=%.2f' % (df['price'].min(), df['price'].max(), df['price'].median()))\n",
"print('negative oder Null-Preise:', int((df['price'] <= 0).sum()))"
]
},
{
"cell_type": "markdown",
"id": "8f1109f3",
"metadata": {},
"source": [
"**Evaluation.** Genauigkeit lässt sich ohne Referenz nur über Plausibilität prüfen: die Preise liegen im sinnvollen Bereich (keine negativen Werte), einzelne sehr hohe Werte (bis ~949) sind plausible Ausreisser (grosse Sets)."
]
},
{
"cell_type": "markdown",
"id": "dd6a6a4c",
"metadata": {},
"source": [
"## 2 Subjektive Datenqualität\n",
"\n",
"Hängt vom **Verwendungszweck** ab (Eignung, Relevanz, Vertrauenswürdigkeit).\n",
"\n",
"**Bewertung für den Zweck *Preisvorhersage von Lego-Sets*:**\n",
"- *Relevant* sind `price` (Ziel), `Anzahl Teile`, `Empfohlenes Alter`, `Produktfarbe`,\n",
" `Ursprungsland` — diese sind brauchbar.\n",
"- *Wenig brauchbar* sind die vielen lückenhaften Logistik-/Verpackungsspalten und die\n",
" konstanten Spalten — sie stiften für diesen Zweck keinen Mehrwert.\n",
"- *Vertrauenswürdigkeit:* mittel — die Kerngrössen wirken solide, aber die vielen\n",
" Lücken und Text-statt-Zahl-Felder erfordern Aufbereitung (→ M08/M11).\n",
"\n",
"Für einen *anderen* Zweck (z.\\,B. Logistikplanung) wären genau die jetzt verworfenen\n",
"Verpackungsspalten zentral — Datenqualität ist also **zweckrelativ**."
]
},
{
"cell_type": "markdown",
"id": "cb2d47fb",
"metadata": {},
"source": [
"## 3 Tidy Data\n",
"\n",
"**Tidy** (Wickham): (1) jede *Variable* eine Spalte, (2) jede *Beobachtung* eine Zeile,\n",
"(3) jede *Beobachtungseinheit* eine Tabelle."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e29cedbb",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-03T21:52:04.804148Z",
"iopub.status.busy": "2026-06-03T21:52:04.804093Z",
"iopub.status.idle": "2026-06-03T21:52:04.805688Z",
"shell.execute_reply": "2026-06-03T21:52:04.805511Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Zeilen = Beobachtungen (Lego-Sets): 890\n",
"Eindeutige Produktnamen: 890\n",
"Spalten = Variablen: 47\n",
"Leere/Pseudo-Spalte vorhanden: True\n"
]
}
],
"source": [
"print('Zeilen = Beobachtungen (Lego-Sets):', len(df))\n",
"print('Eindeutige Produktnamen:', df['name'].nunique())\n",
"print('Spalten = Variablen:', df.shape[1])\n",
"print('Leere/Pseudo-Spalte vorhanden:', 'null' in df.columns)"
]
},
{
"cell_type": "markdown",
"id": "792c7f1b",
"metadata": {},
"source": [
"**Evaluation.** Der Datensatz ist **weitgehend tidy**: eine Zeile pro Lego-Set, eine Spalte pro Eigenschaft, eine Beobachtungseinheit (Produkte). **Verstösse:** die Spalte `null` ist keine echte Variable, einige numerische Grössen stehen als Text, und konstante Spalten tragen keine Variableninformation. Nach Bereinigung (konstante/`null` entfernen, Typen korrigieren) ist er sauber tidy."
]
},
{
"cell_type": "markdown",
"id": "25ca3d6b",
"metadata": {},
"source": [
"## Fazit\n",
"\n",
"Objektiv: hohe Vollständigkeit nur bei den Kernvariablen, einige Konsistenzmängel (konstante Spalten, Text-statt-Zahl). Subjektiv: für die Preisvorhersage gut nutzbar, für Logistik weniger. Struktur: weitgehend tidy mit kleinen, behebbaren Verstössen."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,625 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ad61b27f",
"metadata": {},
"source": [
"# Datenformate & Variablentypen — Musterlösung (Lego)\n",
"\n",
"**Autor:** Prof. Dr. Yves Staudt — M05 Data Formats & Modalities\n",
"\n",
"Musterlösung zu **Aufgabe 4 (Coding)** auf dem Lego-Datensatz (`lego_daten_tidy.csv`):\n",
"\n",
"1. Daten einlesen und Überblick (Form, dtypes, erste Zeilen).\n",
"2. Variablentyp je Spalte bestimmen — numerisch (diskret/kontinuierlich) vs. kategorial\n",
" (nominal/ordinal), mit Beispiel und Begründung.\n",
"3. Kardinalität der kategorialen Variablen berechnen.\n",
"\n",
"Aufbau wie die Codevorlage `3_Codevorlagen/Datenformate.py`: pro Schritt eine Funktion,\n",
"die ihr Ergebnis ausgibt und mit einer `-> …`-Zeile begründet."
]
},
{
"cell_type": "markdown",
"id": "5e6ddf02",
"metadata": {},
"source": [
"## Pakete & Daten"
]
},
{
"cell_type": "code",
"id": "6c786495",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-05T12:46:13.867297447Z",
"start_time": "2026-06-05T12:46:12.834132658Z"
}
},
"source": [
"import pandas as pd\n",
"\n",
"# Pfad relativ zum Notebook-Verzeichnis (.../M05_*/5_Loesungen/)\n",
"DATA_PATH = \"../data/lego/lego_daten_tidy.csv\"\n",
"\n",
"df = pd.read_csv(DATA_PATH)\n",
"print(df.shape)\n",
"df.head()"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(890, 47)\n"
]
},
{
"data": {
"text/plain": [
" name price \\\n",
"0 LEGO Harry Potter Schloss Hogwarts™ mit Schlos... 179.00 \n",
"1 LEGO Disney - Kamera Hommage an Walt Disney ... 119.00 \n",
"2 LEGO Technic - Audi RS Q e-tron (42160) 159.00 \n",
"3 LEGO Technic - Lamborghini Huracán Tecnica (42... 59.95 \n",
"4 LEGO City - Personen-Schnellzug (60337) 169.00 \n",
"\n",
" link_picture Produkttyp \\\n",
"0 https://www.fcw.ch/api/ProcessRequest/191538/B... Bausatz \n",
"1 https://www.fcw.ch/api/ProcessRequest/191541/B... Bausatz \n",
"2 https://www.fcw.ch/api/ProcessRequest/190646/B... Bausatz \n",
"3 https://www.fcw.ch/api/ProcessRequest/190611/B... Bausatz \n",
"4 https://www.fcw.ch/api/ProcessRequest/138905/B... Bausatz \n",
"\n",
" Vorgeschlagenes Geschlecht Empfohlenes Alter in Jahren (mind.) \\\n",
"0 Junge/Mädchen 18.0 \n",
"1 Junge/Mädchen 18.0 \n",
"2 Junge/Mädchen 10.0 \n",
"3 Junge/Mädchen 9.0 \n",
"4 Junge/Mädchen 7.0 \n",
"\n",
" Empfohlenes Alter in Jahren (max.) Anzahl Teile Sound-Effekte \\\n",
"0 99.0 2660.0 \n",
"1 99.0 811.0 \n",
"2 99.0 914.0 \n",
"3 99.0 806.0 \n",
"4 99.0 764.0 \n",
"\n",
" Produktfarbe ... Anzahl Produkte pro Versandkarton \\\n",
"0 Mehrfarbig ... NaN \n",
"1 Mehrfarbig ... NaN \n",
"2 Mehrfarbig ... NaN \n",
"3 Mehrfarbig ... NaN \n",
"4 Mehrfarbig ... NaN \n",
"\n",
" Hauptkarton GTIN (EAN/UPC) Lagenanzahl pro Palette \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Produkte pro Palettenlage Fernbedienung erforderlich \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Warnung vor Erstickungsgefahr Fernbedienung enthalten \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Akkus/Batterien enthalten null \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" LegoCategory \n",
"0 LEGO Harry Potter Schloss Hogwarts™ mit Schlos... \n",
"1 LEGO Disney \n",
"2 LEGO Technic \n",
"3 LEGO Technic \n",
"4 LEGO City \n",
"\n",
"[5 rows x 47 columns]"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>price</th>\n",
" <th>link_picture</th>\n",
" <th>Produkttyp</th>\n",
" <th>Vorgeschlagenes Geschlecht</th>\n",
" <th>Empfohlenes Alter in Jahren (mind.)</th>\n",
" <th>Empfohlenes Alter in Jahren (max.)</th>\n",
" <th>Anzahl Teile</th>\n",
" <th>Sound-Effekte</th>\n",
" <th>Produktfarbe</th>\n",
" <th>...</th>\n",
" <th>Anzahl Produkte pro Versandkarton</th>\n",
" <th>Hauptkarton GTIN (EAN/UPC)</th>\n",
" <th>Lagenanzahl pro Palette</th>\n",
" <th>Produkte pro Palettenlage</th>\n",
" <th>Fernbedienung erforderlich</th>\n",
" <th>Warnung vor Erstickungsgefahr</th>\n",
" <th>Fernbedienung enthalten</th>\n",
" <th>Akkus/Batterien enthalten</th>\n",
" <th>null</th>\n",
" <th>LegoCategory</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>LEGO Harry Potter Schloss Hogwarts™ mit Schlos...</td>\n",
" <td>179.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/191538/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>18.0</td>\n",
" <td>99.0</td>\n",
" <td>2660.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Harry Potter Schloss Hogwarts™ mit Schlos...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>LEGO Disney - Kamera Hommage an Walt Disney ...</td>\n",
" <td>119.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/191541/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>18.0</td>\n",
" <td>99.0</td>\n",
" <td>811.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Disney</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>LEGO Technic - Audi RS Q e-tron (42160)</td>\n",
" <td>159.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/190646/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>10.0</td>\n",
" <td>99.0</td>\n",
" <td>914.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Technic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>LEGO Technic - Lamborghini Huracán Tecnica (42...</td>\n",
" <td>59.95</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/190611/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>9.0</td>\n",
" <td>99.0</td>\n",
" <td>806.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO Technic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>LEGO City - Personen-Schnellzug (60337)</td>\n",
" <td>169.00</td>\n",
" <td>https://www.fcw.ch/api/ProcessRequest/138905/B...</td>\n",
" <td>Bausatz</td>\n",
" <td>Junge/Mädchen</td>\n",
" <td>7.0</td>\n",
" <td>99.0</td>\n",
" <td>764.0</td>\n",
" <td></td>\n",
" <td>Mehrfarbig</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LEGO City</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 47 columns</p>\n",
"</div>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 1
},
{
"cell_type": "markdown",
"id": "3f0c8723",
"metadata": {},
"source": [
"## Schritt 1: Form, Datentypen und erste Zeilen"
]
},
{
"cell_type": "code",
"id": "fa16539f",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-05T12:46:14.177428764Z",
"start_time": "2026-06-05T12:46:13.981596271Z"
}
},
"source": [
"def report_overview(df: pd.DataFrame) -> None:\n",
" \"\"\"Schritt 1: Form, Datentypen und erste Zeilen.\"\"\"\n",
" print(\"## 1 Überblick\")\n",
" print(f\"Form: {df.shape[0]} Beobachtungen, {df.shape[1]} Variablen\")\n",
" print(\"\\nVerteilung der Datentypen:\")\n",
" print(df.dtypes.value_counts().to_string())\n",
" print(\"\\nErste Zeilen (Auszug):\")\n",
" print(df[[\"name\", \"price\", \"Anzahl Teile\", \"Produktfarbe\"]].head(3).to_string(index=False))\n",
" print(\n",
" \"\\n-> read_csv hat die aufbereiteten Zahlen-Spalten als float gelesen; Textspalten\\n\"\n",
" \" bleiben object. Der dtype ist ein erster, aber kein hinreichender Hinweis auf\\n\"\n",
" \" den inhaltlichen Variablentyp.\\n\"\n",
" )\n",
"\n",
"report_overview(df)"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## 1 Überblick\n",
"Form: 890 Beobachtungen, 47 Variablen\n",
"\n",
"Verteilung der Datentypen:\n",
"object 36\n",
"float64 11\n",
"\n",
"Erste Zeilen (Auszug):\n",
" name price Anzahl Teile Produktfarbe\n",
"LEGO Harry Potter Schloss Hogwarts™ mit Schlossgelände (76419) 179.0 2660.0 Mehrfarbig \n",
" LEGO Disney - Kamera Hommage an Walt Disney (43230) 119.0 811.0 Mehrfarbig \n",
" LEGO Technic - Audi RS Q e-tron (42160) 159.0 914.0 Mehrfarbig \n",
"\n",
"-> read_csv hat die aufbereiteten Zahlen-Spalten als float gelesen; Textspalten\n",
" bleiben object. Der dtype ist ein erster, aber kein hinreichender Hinweis auf\n",
" den inhaltlichen Variablentyp.\n",
"\n"
]
}
],
"execution_count": 2
},
{
"cell_type": "markdown",
"id": "3e9aa7b3",
"metadata": {},
"source": [
"## Schritt 2: numerische vs. kategoriale Variablen, mit Beispiel + Begründung"
]
},
{
"cell_type": "code",
"id": "b48f3faf",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-05T12:46:14.584567071Z",
"start_time": "2026-06-05T12:46:14.295527029Z"
}
},
"source": [
"def report_variable_types(df: pd.DataFrame):\n",
" \"\"\"Schritt 2: numerische vs. kategoriale Variablen, mit Beispiel + Begründung.\"\"\"\n",
" numeric = df.select_dtypes(\"number\").columns.tolist()\n",
" categorical = df.select_dtypes(\"object\").columns.tolist()\n",
" print(\"## 2 Variablentypen\")\n",
" print(f\"numerisch (float/int): {len(numeric)} kategorial (object): {len(categorical)}\")\n",
" print(\n",
" \"\\nBeispiele und Einordnung:\\n\"\n",
" \" numerisch, kontinuierlich : price, Paketgewicht (beliebig fein messbar)\\n\"\n",
" \" numerisch, diskret : 'Anzahl Teile', 'Empfohlenes Alter ...' (abzählbar)\\n\"\n",
" \" kategorial, nominal : Produktfarbe, Material, Ursprungsland, LegoCategory\\n\"\n",
" \" (keine natürliche Reihenfolge)\\n\"\n",
" \" kategorial, ordinal : im Datensatz nicht eindeutig vorhanden\"\n",
" )\n",
" print(\n",
" \"\\n-> Der Typ ergibt sich aus der Bedeutung, nicht nur aus dem dtype: 'Anzahl Teile'\\n\"\n",
" \" ist als float gespeichert, aber inhaltlich eine diskrete Zählgrösse. Der Typ\\n\"\n",
" \" bestimmt Encoding (kategorial) bzw. Skalierung (numerisch).\\n\"\n",
" )\n",
" return numeric, categorical\n",
"\n",
"report_variable_types(df)"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## 2 Variablentypen\n",
"numerisch (float/int): 11 kategorial (object): 36\n",
"\n",
"Beispiele und Einordnung:\n",
" numerisch, kontinuierlich : price, Paketgewicht (beliebig fein messbar)\n",
" numerisch, diskret : 'Anzahl Teile', 'Empfohlenes Alter ...' (abzählbar)\n",
" kategorial, nominal : Produktfarbe, Material, Ursprungsland, LegoCategory\n",
" (keine natürliche Reihenfolge)\n",
" kategorial, ordinal : im Datensatz nicht eindeutig vorhanden\n",
"\n",
"-> Der Typ ergibt sich aus der Bedeutung, nicht nur aus dem dtype: 'Anzahl Teile'\n",
" ist als float gespeichert, aber inhaltlich eine diskrete Zählgrösse. Der Typ\n",
" bestimmt Encoding (kategorial) bzw. Skalierung (numerisch).\n",
"\n"
]
},
{
"data": {
"text/plain": [
"(['price',\n",
" 'Empfohlenes Alter in Jahren (mind.)',\n",
" 'Empfohlenes Alter in Jahren (max.)',\n",
" 'Anzahl Teile',\n",
" 'Verpackungsbreite',\n",
" 'Verpackungstiefe',\n",
" 'Verpackungshöhe',\n",
" 'Paketgewicht',\n",
" 'Menge pro Packung',\n",
" 'Anzahl der Versandkartons pro Vorlagenkarton',\n",
" 'Anzahl der Pakete'],\n",
" ['name',\n",
" 'link_picture',\n",
" 'Produkttyp',\n",
" 'Vorgeschlagenes Geschlecht',\n",
" 'Sound-Effekte',\n",
" 'Produktfarbe',\n",
" 'Material',\n",
" 'Ursprungsland',\n",
" 'EU TSD Warnung',\n",
" 'EU TSD Sprache',\n",
" 'Sicherheitswarnung',\n",
" 'Batterien erforderlich',\n",
" 'Verpackungsart',\n",
" 'Montage erforderlich',\n",
" 'ASIN',\n",
" 'Montagezeit',\n",
" 'Freigabedatum (TT/MM/JJ)',\n",
" 'Figur enthalten',\n",
" 'Nachhaltigkeitszertifikate',\n",
" 'Menge pro Versandkarton',\n",
" 'Länge des Versandkartons',\n",
" 'Breite des Versandkartons',\n",
" 'Höhe des Versandkartons',\n",
" 'Gewicht Versandkarton',\n",
" 'Nettogewicht Hauptkarton',\n",
" 'Versandkarton pro Palettenlage',\n",
" 'Anzahl Produkte pro Versandkarton',\n",
" 'Hauptkarton GTIN (EAN/UPC)',\n",
" 'Lagenanzahl pro Palette',\n",
" 'Produkte pro Palettenlage',\n",
" 'Fernbedienung erforderlich',\n",
" 'Warnung vor Erstickungsgefahr',\n",
" 'Fernbedienung enthalten',\n",
" 'Akkus/Batterien enthalten',\n",
" 'null',\n",
" 'LegoCategory'])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 3
},
{
"cell_type": "markdown",
"id": "eb5be0c6",
"metadata": {},
"source": [
"## Schritt 3: Kardinalität (Anzahl Ausprägungen) je kategorialer Variable"
]
},
{
"cell_type": "code",
"id": "ca50d3b0",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-05T12:46:14.845459602Z",
"start_time": "2026-06-05T12:46:14.704469052Z"
}
},
"source": [
"def report_cardinality(df: pd.DataFrame) -> pd.Series:\n",
" \"\"\"Schritt 3: Kardinalität (Anzahl Ausprägungen) je kategorialer Variable.\"\"\"\n",
" card = df.select_dtypes(\"object\").nunique().sort_values(ascending=False)\n",
" print(\"## 3 Kardinalität der kategorialen Variablen\")\n",
" print(\"Höchste Kardinalität:\")\n",
" print(card.head(6).to_string())\n",
" print(\"\\nNiedrigste Kardinalität:\")\n",
" print(card.tail(6).to_string())\n",
" print(\n",
" \"\\n-> Sehr hohe Kardinalität (z.B. name, link_picture mit 890 = ein Wert je Zeile,\\n\"\n",
" \" ASIN) kennzeichnet ID-artige Variablen, die sich nicht direkt für One-Hot-\\n\"\n",
" \" Encoding eignen. Niedrige Kardinalität (2-5) lässt sich problemlos kodieren.\\n\"\n",
" )\n",
" return card\n",
"\n",
"report_cardinality(df)"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## 3 Kardinalität der kategorialen Variablen\n",
"Höchste Kardinalität:\n",
"name 890\n",
"link_picture 890\n",
"ASIN 640\n",
"Hauptkarton GTIN (EAN/UPC) 159\n",
"LegoCategory 149\n",
"Gewicht Versandkarton 143\n",
"\n",
"Niedrigste Kardinalität:\n",
"Material 1\n",
"Fernbedienung erforderlich 1\n",
"Warnung vor Erstickungsgefahr 1\n",
"Fernbedienung enthalten 1\n",
"Akkus/Batterien enthalten 1\n",
"null 1\n",
"\n",
"-> Sehr hohe Kardinalität (z.B. name, link_picture mit 890 = ein Wert je Zeile,\n",
" ASIN) kennzeichnet ID-artige Variablen, die sich nicht direkt für One-Hot-\n",
" Encoding eignen. Niedrige Kardinalität (2-5) lässt sich problemlos kodieren.\n",
"\n"
]
},
{
"data": {
"text/plain": [
"name 890\n",
"link_picture 890\n",
"ASIN 640\n",
"Hauptkarton GTIN (EAN/UPC) 159\n",
"LegoCategory 149\n",
"Gewicht Versandkarton 143\n",
"Nettogewicht Hauptkarton 121\n",
"Breite des Versandkartons 32\n",
"Produkte pro Palettenlage 31\n",
"Länge des Versandkartons 30\n",
"Freigabedatum (TT/MM/JJ) 27\n",
"Höhe des Versandkartons 23\n",
"Versandkarton pro Palettenlage 18\n",
"Lagenanzahl pro Palette 8\n",
"Anzahl Produkte pro Versandkarton 8\n",
"Produktfarbe 8\n",
"Menge pro Versandkarton 8\n",
"Ursprungsland 7\n",
"EU TSD Sprache 6\n",
"EU TSD Warnung 6\n",
"Verpackungsart 4\n",
"Vorgeschlagenes Geschlecht 3\n",
"Sicherheitswarnung 3\n",
"Produkttyp 2\n",
"Nachhaltigkeitszertifikate 2\n",
"Montagezeit 2\n",
"Sound-Effekte 1\n",
"Batterien erforderlich 1\n",
"Figur enthalten 1\n",
"Montage erforderlich 1\n",
"Material 1\n",
"Fernbedienung erforderlich 1\n",
"Warnung vor Erstickungsgefahr 1\n",
"Fernbedienung enthalten 1\n",
"Akkus/Batterien enthalten 1\n",
"null 1\n",
"dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 4
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,489 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Datenformate\n",
"Author: Prof. Dr. Yves Staudt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In diesem Code besprechen wir die Datenformate. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Packages"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:28:49.505256930Z",
"start_time": "2026-06-04T12:28:48.009495077Z"
}
},
"source": [
"import pandas as pd\n",
"\n",
"# Feature Engine\n",
"from feature_engine.imputation import (\n",
" AddMissingIndicator,\n",
" MeanMedianImputer,\n",
" CategoricalImputer,\n",
")\n",
"\n",
"from feature_engine.encoding import RareLabelEncoder, OneHotEncoder, CountFrequencyEncoder\n",
"from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures\n",
"from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection\n",
"from feature_engine import transformation as vt\n",
"from feature_engine.wrappers import SklearnTransformerWrapper\n",
"\n",
"# Scikit-Learn - Visualisation\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Visualisation\n",
"import matplotlib.pyplot as plt\n",
"import plotly.express as px"
],
"outputs": [],
"execution_count": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Data"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:25.282827384Z",
"start_time": "2026-06-04T12:29:24.966894168Z"
}
},
"source": [
"df = pd.read_csv(\"../data/hotel_bookings/hotel_bookings.csv\")\n",
"print(f\"Grösse des Datensatzes: {df.shape}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Grösse des Datensatzes: (119390, 32)\n"
]
}
],
"execution_count": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Manuelle Daten löschen \n",
"Bestimmte Variablen wie das genaue Jahr, spezifische Tage oder Kalenderwochen bieten in einer Analyse nur begrenzten Mehrwert. Daher entfernen wir die Variablen *`arrival_date_year`*, *`arrival_date_week_number`* und *`arrival_date_day_of_month`*."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:26.892202838Z",
"start_time": "2026-06-04T12:29:26.816170364Z"
}
},
"source": [
"df = df.drop(columns=['arrival_date_year','arrival_date_week_number','arrival_date_day_of_month'])\n",
"print(f\"Shape of data frame after remove of variables: {df.shape}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of data frame after remove of variables: (119390, 29)\n"
]
}
],
"execution_count": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing Variables\n",
"Zuerst schauen wir die Variablen und deren fehlende Werte an. "
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:29.419975590Z",
"start_time": "2026-06-04T12:29:29.344252165Z"
}
},
"source": [
"missing_values = df.isnull().mean() * 100 \n",
"missing_values = missing_values[missing_values > 0] \n",
"\n",
"missing_values = missing_values.sort_values(ascending=False)\n",
"\n",
"print(f\"Liste der fehlenden Variablen: {missing_values}\")\n",
"print(f\"Anzahl Variablen mit fehldenden Werten: {len(missing_values)}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liste der fehlenden Variablen: company 94.306893\n",
"agent 13.686238\n",
"country 0.408744\n",
"children 0.003350\n",
"dtype: float64\n",
"Anzahl Variablen mit fehldenden Werten: 4\n"
]
}
],
"execution_count": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wir haben vier Variablen mit fehlenden Werten identifiziert. Eine dieser Variablen weist einen hohen Anteil fehlender Werte auf (94%), während die Variablen `country` und `children` nur einen geringen Anteil an fehlenden Werten haben. Für `country` und `children` könnte es sinnvoll sein, die betroffenen Zeilen einfach zu löschen. In diesem Bericht füllen wir jedoch die fehlenden Werte auf, um die Daten vollständig zu machen."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train Test Split \n",
"Bevor wir nun weiter die Daten aufbereiten, sollen die Daten in ein Train- und Test-Split aufgeteilt werden. Dieser Schritt muss zwingend vor dem Auffüllen gemacht werden. Wenn wir den gesamten Datensatz als Ganzes auffüllen, dann fliessen Informationen aus dem Test-Split unbeabsichtigt in den Train-Split ein (Data Leakage). Der Testdatensatz wäre dann nicht mehr \"ungesehen\". \n",
"Daher werden wir die Datenaufbereitungsmethoden, wie das Auffüllen von fehlenden Werten, ausschliesslich auf dem Trainingsdatensatz erlernt und später auf den Test-Split überführt. Dies spiegelt auch den realen Anwendungsfall wieder, da neue Daten mit der gleichen und bereits bekannten Methodik vorverarbeitet werden müssen. \n",
"\n",
"Wir bereiten hier den Datensatz mit allgmeinen Schritten auf, wobei der Preis die Zielvariable ist."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:33.765770122Z",
"start_time": "2026-06-04T12:29:33.709786292Z"
}
},
"source": [
"x = df.drop(columns = ['adr'])\n",
"y = df['adr']\n",
"print(f\"The shape of the data set with training varialbes is: {x.shape}\")\n",
"print(f\"The shape of the target variable is: {y.shape}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The shape of the data set with training varialbes is: (119390, 28)\n",
"The shape of the target variable is: (119390,)\n"
]
}
],
"execution_count": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nachdem die erklärenden Variablen von der Zielvariable getrennt wurden, können wir nun die Train und Test Trennung durchführen."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:37.138431542Z",
"start_time": "2026-06-04T12:29:37.021226449Z"
}
},
"source": [
"X_train, X_test, y_train, y_test = train_test_split(\n",
" x,\n",
" y,\n",
" test_size=0.3,\n",
" random_state=0)\n",
"\n",
"print(f\"The shape of the training sample is: {X_train.shape}\")\n",
"print(f\"The shape of the test sample is: {X_test.shape}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The shape of the training sample is: (83573, 28)\n",
"The shape of the test sample is: (35817, 28)\n"
]
}
],
"execution_count": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Daten Formate\n",
"Wir haben gelernt, dass die Datenformate eine wichtige Rolle spielen, daher schauen wir uns die zuerst an. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Kategorielle Variablen"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:40.698467726Z",
"start_time": "2026-06-04T12:29:40.645339731Z"
}
},
"source": [
"X_train.dtypes"
],
"outputs": [
{
"data": {
"text/plain": [
"hotel object\n",
"is_canceled int64\n",
"lead_time int64\n",
"arrival_date_month object\n",
"stays_in_weekend_nights int64\n",
"stays_in_week_nights int64\n",
"adults int64\n",
"children float64\n",
"babies int64\n",
"meal object\n",
"country object\n",
"market_segment object\n",
"distribution_channel object\n",
"is_repeated_guest int64\n",
"previous_cancellations int64\n",
"previous_bookings_not_canceled int64\n",
"reserved_room_type object\n",
"assigned_room_type object\n",
"booking_changes int64\n",
"deposit_type object\n",
"agent float64\n",
"company float64\n",
"days_in_waiting_list int64\n",
"customer_type object\n",
"required_car_parking_spaces int64\n",
"total_of_special_requests int64\n",
"reservation_status object\n",
"reservation_status_date object\n",
"dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"execution_count": 10
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:42.873734903Z",
"start_time": "2026-06-04T12:29:42.816563590Z"
}
},
"source": [
"# Extract column names with data type 'object'\n",
"# Dies inkludiert alle Spalten, welche mehrheitlich nicht durch numerische Werte besetzt werden. \n",
"object_columns = X_train.select_dtypes(include=['object']).columns\n",
"\n",
"# Print the column names with 'object' data type\n",
"print(f\"Liste von kategorischen Variablen: {object_columns}\")\n",
"\n",
"# Länge der Liste\n",
"print(f\"Lange der Liste von kategorischen Variablen: {len(object_columns)}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liste von kategorischen Variablen: Index(['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment',\n",
" 'distribution_channel', 'reserved_room_type', 'assigned_room_type',\n",
" 'deposit_type', 'customer_type', 'reservation_status',\n",
" 'reservation_status_date'],\n",
" dtype='object')\n",
"Lange der Liste von kategorischen Variablen: 12\n"
]
}
],
"execution_count": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In einem weiteren Schritt schauen wir die Kardinalität der kategorischen Variablen an. "
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:45.145710821Z",
"start_time": "2026-06-04T12:29:45.039519892Z"
}
},
"source": [
"for col in X_train.select_dtypes(include=['object']):\n",
" cardinality = len(pd.Index(X_train[col].value_counts()))\n",
" print(X_train[col].name + \": \" + str(cardinality))"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"hotel: 2\n",
"arrival_date_month: 12\n",
"meal: 5\n",
"country: 162\n",
"market_segment: 8\n",
"distribution_channel: 5\n",
"reserved_room_type: 10\n",
"assigned_room_type: 11\n",
"deposit_type: 3\n",
"customer_type: 4\n",
"reservation_status: 3\n",
"reservation_status_date: 916\n"
]
}
],
"execution_count": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die Variable `hotel` ist binär und sollte immer mithilfe des k1-One-Hot-Encoding umgewandelt werden. Die Variablen `country` und `reservation_status_date` weisen im Vergleich zu den anderen Variablen eine hohe Kardinalität auf. `reservation_status_date` ist eine Datumsvariable, die in ihrer ursprünglichen Form nicht direkt genutzt werden kann. Da Transformationen dieser Variable bereits umgesetzt wurden, wird sie in diesem Schritt entfernt."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:47.019115286Z",
"start_time": "2026-06-04T12:29:46.938571309Z"
}
},
"source": [
"X_train = X_train.drop(columns=['reservation_status_date']) \n",
"print(f\"\\nShape of the data set after remove of unnecessary variables: {X_train.shape}\")\n",
"X_test = X_test.drop(columns=['reservation_status_date']) \n",
"print(f\"\\nShape of the data set after remove of unnecessary variables: {X_test.shape}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Shape of the data set after remove of unnecessary variables: (83573, 27)\n",
"\n",
"Shape of the data set after remove of unnecessary variables: (35817, 27)\n"
]
}
],
"execution_count": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Numerische Variablen"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2026-06-04T12:29:48.162866530Z",
"start_time": "2026-06-04T12:29:48.085829937Z"
}
},
"source": [
"# Extract column names with data type 'float'\n",
"numeric_columns = X_train.select_dtypes(include= ['float']).columns\n",
"\n",
"# Print the column names with 'float' data type\n",
"print(f\"Liste von numerischen Variablen: {numeric_columns}\")\n",
"\n",
"# Länge der Liste\n",
"print(f\"Lange der Liste von numerischen Variablen: {len(numeric_columns)}\")"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liste von numerischen Variablen: Index(['children', 'agent', 'company'], dtype='object')\n",
"Lange der Liste von numerischen Variablen: 3\n"
]
}
],
"execution_count": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Es gibt lediglich drei numerische Variablen in unserem Datensatz. Dabei handelt es sich unter anderem um die Variablen `agent` und `company`, die ursprünglich kategorisch waren, aber durch die Anonymisierung in numerische Werte umgewandelt wurden."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because it is too large Load Diff

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff