diverses

2026-06-05 09:18:57 +02:00 · 2026-06-05 09:18:57 +02:00 · 89a8543159
commit 89a8543159
parent 9f14e6c888
1 changed files with 434 additions and 0 deletions
--- a/notebooks/Datenformate.ipynb
+++ b/notebooks/Datenformate.ipynb
@ -0,0 +1,434 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Datenformate\n",
+    "Author: Prof. Dr. Yves Staudt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In diesem Code besprechen wir die Datenformate. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading Packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Feature Engine\n",
+    "from feature_engine.imputation import (\n",
+    "    AddMissingIndicator,\n",
+    "    MeanMedianImputer,\n",
+    "    CategoricalImputer,\n",
+    ")\n",
+    "\n",
+    "from feature_engine.encoding import RareLabelEncoder, OneHotEncoder, CountFrequencyEncoder\n",
+    "from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures\n",
+    "from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection\n",
+    "from feature_engine import transformation as vt\n",
+    "from feature_engine.wrappers import SklearnTransformerWrapper\n",
+    "\n",
+    "# Scikit-Learn - Visualisation\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "\n",
+    "# Visualisation\n",
+    "import matplotlib.pyplot as plt\n",
+    "import plotly.express as px"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Grösse des Datensatzes: (119390, 32)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = pd.read_csv(\"../Data/hotel_bookings.csv\")\n",
+    "print(f\"Grösse des Datensatzes: {df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Manuelle Daten löschen \n",
+    "Bestimmte Variablen wie das genaue Jahr, spezifische Tage oder Kalenderwochen bieten in einer Analyse nur begrenzten Mehrwert. Daher entfernen wir die Variablen *`arrival_date_year`*, *`arrival_date_week_number`* und *`arrival_date_day_of_month`*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Shape of data frame after remove of variables: (119390, 29)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.drop(columns=['arrival_date_year','arrival_date_week_number','arrival_date_day_of_month'])\n",
+    "print(f\"Shape of data frame after remove of variables: {df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Missing Variables\n",
+    "Zuerst schauen wir die Variablen und deren fehlende Werte an. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Liste der fehlenden Variablen: company     94.306893\n",
+      "agent       13.686238\n",
+      "country      0.408744\n",
+      "children     0.003350\n",
+      "dtype: float64\n",
+      "Anzahl Variablen mit fehldenden Werten: 4\n"
+     ]
+    }
+   ],
+   "source": [
+    "missing_values = df.isnull().mean() * 100  \n",
+    "missing_values = missing_values[missing_values > 0]  \n",
+    "\n",
+    "missing_values = missing_values.sort_values(ascending=False)\n",
+    "\n",
+    "print(f\"Liste der fehlenden Variablen: {missing_values}\")\n",
+    "print(f\"Anzahl Variablen mit fehldenden Werten: {len(missing_values)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Wir haben vier Variablen mit fehlenden Werten identifiziert. Eine dieser Variablen weist einen hohen Anteil fehlender Werte auf (94%), während die Variablen `country` und `children` nur einen geringen Anteil an fehlenden Werten haben. Für `country` und `children` könnte es sinnvoll sein, die betroffenen Zeilen einfach zu löschen. In diesem Bericht füllen wir jedoch die fehlenden Werte auf, um die Daten vollständig zu machen."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train Test Split \n",
+    "Bevor wir nun weiter die Daten aufbereiten, sollen die Daten in ein Train- und Test-Split aufgeteilt werden. Dieser Schritt muss zwingend vor dem Auffüllen gemacht werden. Wenn wir den gesamten Datensatz als Ganzes auffüllen, dann fliessen Informationen aus dem Test-Split unbeabsichtigt in den Train-Split ein (Data Leakage). Der Testdatensatz wäre dann nicht mehr \"ungesehen\". \n",
+    "Daher werden wir die Datenaufbereitungsmethoden, wie das Auffüllen von fehlenden Werten, ausschliesslich auf dem Trainingsdatensatz erlernt und später auf den Test-Split überführt. Dies spiegelt auch den realen Anwendungsfall wieder, da neue Daten mit der gleichen und bereits bekannten Methodik vorverarbeitet werden müssen. \n",
+    "\n",
+    "Wir bereiten hier den Datensatz mit allgmeinen Schritten auf, wobei der Preis die Zielvariable ist."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The shape of the data set with training varialbes is: (119390, 28)\n",
+      "The shape of the target variable is: (119390,)\n"
+     ]
+    }
+   ],
+   "source": [
+    "x = df.drop(columns = ['adr'])\n",
+    "y = df['adr']\n",
+    "print(f\"The shape of the data set with training varialbes is: {x.shape}\")\n",
+    "print(f\"The shape of the target variable is: {y.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Nachdem die erklärenden Variablen von der Zielvariable getrennt wurden, können wir nun die Train und Test Trennung durchführen."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The shape of the training sample is: (83573, 28)\n",
+      "The shape of the test sample is: (35817, 28)\n"
+     ]
+    }
+   ],
+   "source": [
+    "X_train, X_test, y_train, y_test = train_test_split(\n",
+    "    x,\n",
+    "    y,\n",
+    "    test_size=0.3,\n",
+    "    random_state=0)\n",
+    "\n",
+    "print(f\"The shape of the training sample is: {X_train.shape}\")\n",
+    "print(f\"The shape of the test sample is: {X_test.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Daten Formate\n",
+    "Wir haben gelernt, dass die Datenformate eine wichtige Rolle spielen, daher schauen wir uns die zuerst an. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Kategorielle Variablen"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "hotel                              object\n",
+       "is_canceled                         int64\n",
+       "lead_time                           int64\n",
+       "arrival_date_month                 object\n",
+       "stays_in_weekend_nights             int64\n",
+       "stays_in_week_nights                int64\n",
+       "adults                              int64\n",
+       "children                          float64\n",
+       "babies                              int64\n",
+       "meal                               object\n",
+       "country                            object\n",
+       "market_segment                     object\n",
+       "distribution_channel               object\n",
+       "is_repeated_guest                   int64\n",
+       "previous_cancellations              int64\n",
+       "previous_bookings_not_canceled      int64\n",
+       "reserved_room_type                 object\n",
+       "assigned_room_type                 object\n",
+       "booking_changes                     int64\n",
+       "deposit_type                       object\n",
+       "agent                             float64\n",
+       "company                           float64\n",
+       "days_in_waiting_list                int64\n",
+       "customer_type                      object\n",
+       "required_car_parking_spaces         int64\n",
+       "total_of_special_requests           int64\n",
+       "reservation_status                 object\n",
+       "reservation_status_date            object\n",
+       "dtype: object"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "X_train.dtypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Liste von kategorischen Variablen: Index(['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment',\n",
+      "       'distribution_channel', 'reserved_room_type', 'assigned_room_type',\n",
+      "       'deposit_type', 'customer_type', 'reservation_status',\n",
+      "       'reservation_status_date'],\n",
+      "      dtype='object')\n",
+      "Lange der Liste von kategorischen Variablen: 12\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Extract column names with data type 'object'\n",
+    "# Dies inkludiert alle Spalten, welche mehrheitlich nicht durch numerische Werte besetzt werden. \n",
+    "object_columns = X_train.select_dtypes(include=['object']).columns\n",
+    "\n",
+    "# Print the column names with 'object' data type\n",
+    "print(f\"Liste von kategorischen Variablen: {object_columns}\")\n",
+    "\n",
+    "# Länge der Liste\n",
+    "print(f\"Lange der Liste von kategorischen Variablen: {len(object_columns)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In einem weiteren Schritt schauen wir die Kardinalität der kategorischen Variablen an. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "hotel: 2\n",
+      "arrival_date_month: 12\n",
+      "meal: 5\n",
+      "country: 162\n",
+      "market_segment: 8\n",
+      "distribution_channel: 5\n",
+      "reserved_room_type: 10\n",
+      "assigned_room_type: 11\n",
+      "deposit_type: 3\n",
+      "customer_type: 4\n",
+      "reservation_status: 3\n",
+      "reservation_status_date: 916\n"
+     ]
+    }
+   ],
+   "source": [
+    "for col in  X_train.select_dtypes(include=['object']):\n",
+    "    cardinality = len(pd.Index(X_train[col].value_counts()))\n",
+    "    print(X_train[col].name + \": \" + str(cardinality))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Die Variable `hotel` ist binär und sollte immer mithilfe des k−1-One-Hot-Encoding umgewandelt werden. Die Variablen `country` und `reservation_status_date` weisen im Vergleich zu den anderen Variablen eine hohe Kardinalität auf. `reservation_status_date` ist eine Datumsvariable, die in ihrer ursprünglichen Form nicht direkt genutzt werden kann. Da Transformationen dieser Variable bereits umgesetzt wurden, wird sie in diesem Schritt entfernt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Shape of the data set after remove of unnecessary variables: (83573, 27)\n",
+      "\n",
+      "Shape of the data set after remove of unnecessary variables: (35817, 27)\n"
+     ]
+    }
+   ],
+   "source": [
+    "X_train = X_train.drop(columns=['reservation_status_date']) \n",
+    "print(f\"\\nShape of the data set after remove of unnecessary variables: {X_train.shape}\")\n",
+    "X_test = X_test.drop(columns=['reservation_status_date']) \n",
+    "print(f\"\\nShape of the data set after remove of unnecessary variables: {X_test.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Numerische Variablen"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Liste von numerischen Variablen: Index(['children', 'agent', 'company'], dtype='object')\n",
+      "Lange der Liste von numerischen Variablen: 3\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Extract column names with data type 'float'\n",
+    "numeric_columns = X_train.select_dtypes(include= ['float']).columns\n",
+    "\n",
+    "# Print the column names with 'float' data type\n",
+    "print(f\"Liste von numerischen Variablen: {numeric_columns}\")\n",
+    "\n",
+    "# Länge der Liste\n",
+    "print(f\"Lange der Liste von numerischen Variablen: {len(numeric_columns)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Es gibt lediglich drei numerische Variablen in unserem Datensatz. Dabei handelt es sich unter anderem um die Variablen `agent` und `company`, die ursprünglich kategorisch waren, aber durch die Anonymisierung in numerische Werte umgewandelt wurden."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv (3.12.11)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}