cds-104/notebooks/Datenformate.ipynb
2026-06-05 09:18:57 +02:00

435 lines
13 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Datenformate\n",
"Author: Prof. Dr. Yves Staudt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In diesem Code besprechen wir die Datenformate. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Feature Engine\n",
"from feature_engine.imputation import (\n",
" AddMissingIndicator,\n",
" MeanMedianImputer,\n",
" CategoricalImputer,\n",
")\n",
"\n",
"from feature_engine.encoding import RareLabelEncoder, OneHotEncoder, CountFrequencyEncoder\n",
"from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures\n",
"from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection\n",
"from feature_engine import transformation as vt\n",
"from feature_engine.wrappers import SklearnTransformerWrapper\n",
"\n",
"# Scikit-Learn - Visualisation\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Visualisation\n",
"import matplotlib.pyplot as plt\n",
"import plotly.express as px"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Grösse des Datensatzes: (119390, 32)\n"
]
}
],
"source": [
"df = pd.read_csv(\"../Data/hotel_bookings.csv\")\n",
"print(f\"Grösse des Datensatzes: {df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Manuelle Daten löschen \n",
"Bestimmte Variablen wie das genaue Jahr, spezifische Tage oder Kalenderwochen bieten in einer Analyse nur begrenzten Mehrwert. Daher entfernen wir die Variablen *`arrival_date_year`*, *`arrival_date_week_number`* und *`arrival_date_day_of_month`*."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of data frame after remove of variables: (119390, 29)\n"
]
}
],
"source": [
"df = df.drop(columns=['arrival_date_year','arrival_date_week_number','arrival_date_day_of_month'])\n",
"print(f\"Shape of data frame after remove of variables: {df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing Variables\n",
"Zuerst schauen wir die Variablen und deren fehlende Werte an. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liste der fehlenden Variablen: company 94.306893\n",
"agent 13.686238\n",
"country 0.408744\n",
"children 0.003350\n",
"dtype: float64\n",
"Anzahl Variablen mit fehldenden Werten: 4\n"
]
}
],
"source": [
"missing_values = df.isnull().mean() * 100 \n",
"missing_values = missing_values[missing_values > 0] \n",
"\n",
"missing_values = missing_values.sort_values(ascending=False)\n",
"\n",
"print(f\"Liste der fehlenden Variablen: {missing_values}\")\n",
"print(f\"Anzahl Variablen mit fehldenden Werten: {len(missing_values)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wir haben vier Variablen mit fehlenden Werten identifiziert. Eine dieser Variablen weist einen hohen Anteil fehlender Werte auf (94%), während die Variablen `country` und `children` nur einen geringen Anteil an fehlenden Werten haben. Für `country` und `children` könnte es sinnvoll sein, die betroffenen Zeilen einfach zu löschen. In diesem Bericht füllen wir jedoch die fehlenden Werte auf, um die Daten vollständig zu machen."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train Test Split \n",
"Bevor wir nun weiter die Daten aufbereiten, sollen die Daten in ein Train- und Test-Split aufgeteilt werden. Dieser Schritt muss zwingend vor dem Auffüllen gemacht werden. Wenn wir den gesamten Datensatz als Ganzes auffüllen, dann fliessen Informationen aus dem Test-Split unbeabsichtigt in den Train-Split ein (Data Leakage). Der Testdatensatz wäre dann nicht mehr \"ungesehen\". \n",
"Daher werden wir die Datenaufbereitungsmethoden, wie das Auffüllen von fehlenden Werten, ausschliesslich auf dem Trainingsdatensatz erlernt und später auf den Test-Split überführt. Dies spiegelt auch den realen Anwendungsfall wieder, da neue Daten mit der gleichen und bereits bekannten Methodik vorverarbeitet werden müssen. \n",
"\n",
"Wir bereiten hier den Datensatz mit allgmeinen Schritten auf, wobei der Preis die Zielvariable ist."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The shape of the data set with training varialbes is: (119390, 28)\n",
"The shape of the target variable is: (119390,)\n"
]
}
],
"source": [
"x = df.drop(columns = ['adr'])\n",
"y = df['adr']\n",
"print(f\"The shape of the data set with training varialbes is: {x.shape}\")\n",
"print(f\"The shape of the target variable is: {y.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nachdem die erklärenden Variablen von der Zielvariable getrennt wurden, können wir nun die Train und Test Trennung durchführen."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The shape of the training sample is: (83573, 28)\n",
"The shape of the test sample is: (35817, 28)\n"
]
}
],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(\n",
" x,\n",
" y,\n",
" test_size=0.3,\n",
" random_state=0)\n",
"\n",
"print(f\"The shape of the training sample is: {X_train.shape}\")\n",
"print(f\"The shape of the test sample is: {X_test.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Daten Formate\n",
"Wir haben gelernt, dass die Datenformate eine wichtige Rolle spielen, daher schauen wir uns die zuerst an. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Kategorielle Variablen"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"hotel object\n",
"is_canceled int64\n",
"lead_time int64\n",
"arrival_date_month object\n",
"stays_in_weekend_nights int64\n",
"stays_in_week_nights int64\n",
"adults int64\n",
"children float64\n",
"babies int64\n",
"meal object\n",
"country object\n",
"market_segment object\n",
"distribution_channel object\n",
"is_repeated_guest int64\n",
"previous_cancellations int64\n",
"previous_bookings_not_canceled int64\n",
"reserved_room_type object\n",
"assigned_room_type object\n",
"booking_changes int64\n",
"deposit_type object\n",
"agent float64\n",
"company float64\n",
"days_in_waiting_list int64\n",
"customer_type object\n",
"required_car_parking_spaces int64\n",
"total_of_special_requests int64\n",
"reservation_status object\n",
"reservation_status_date object\n",
"dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liste von kategorischen Variablen: Index(['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment',\n",
" 'distribution_channel', 'reserved_room_type', 'assigned_room_type',\n",
" 'deposit_type', 'customer_type', 'reservation_status',\n",
" 'reservation_status_date'],\n",
" dtype='object')\n",
"Lange der Liste von kategorischen Variablen: 12\n"
]
}
],
"source": [
"# Extract column names with data type 'object'\n",
"# Dies inkludiert alle Spalten, welche mehrheitlich nicht durch numerische Werte besetzt werden. \n",
"object_columns = X_train.select_dtypes(include=['object']).columns\n",
"\n",
"# Print the column names with 'object' data type\n",
"print(f\"Liste von kategorischen Variablen: {object_columns}\")\n",
"\n",
"# Länge der Liste\n",
"print(f\"Lange der Liste von kategorischen Variablen: {len(object_columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In einem weiteren Schritt schauen wir die Kardinalität der kategorischen Variablen an. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"hotel: 2\n",
"arrival_date_month: 12\n",
"meal: 5\n",
"country: 162\n",
"market_segment: 8\n",
"distribution_channel: 5\n",
"reserved_room_type: 10\n",
"assigned_room_type: 11\n",
"deposit_type: 3\n",
"customer_type: 4\n",
"reservation_status: 3\n",
"reservation_status_date: 916\n"
]
}
],
"source": [
"for col in X_train.select_dtypes(include=['object']):\n",
" cardinality = len(pd.Index(X_train[col].value_counts()))\n",
" print(X_train[col].name + \": \" + str(cardinality))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die Variable `hotel` ist binär und sollte immer mithilfe des k1-One-Hot-Encoding umgewandelt werden. Die Variablen `country` und `reservation_status_date` weisen im Vergleich zu den anderen Variablen eine hohe Kardinalität auf. `reservation_status_date` ist eine Datumsvariable, die in ihrer ursprünglichen Form nicht direkt genutzt werden kann. Da Transformationen dieser Variable bereits umgesetzt wurden, wird sie in diesem Schritt entfernt."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Shape of the data set after remove of unnecessary variables: (83573, 27)\n",
"\n",
"Shape of the data set after remove of unnecessary variables: (35817, 27)\n"
]
}
],
"source": [
"X_train = X_train.drop(columns=['reservation_status_date']) \n",
"print(f\"\\nShape of the data set after remove of unnecessary variables: {X_train.shape}\")\n",
"X_test = X_test.drop(columns=['reservation_status_date']) \n",
"print(f\"\\nShape of the data set after remove of unnecessary variables: {X_test.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Numerische Variablen"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liste von numerischen Variablen: Index(['children', 'agent', 'company'], dtype='object')\n",
"Lange der Liste von numerischen Variablen: 3\n"
]
}
],
"source": [
"# Extract column names with data type 'float'\n",
"numeric_columns = X_train.select_dtypes(include= ['float']).columns\n",
"\n",
"# Print the column names with 'float' data type\n",
"print(f\"Liste von numerischen Variablen: {numeric_columns}\")\n",
"\n",
"# Länge der Liste\n",
"print(f\"Lange der Liste von numerischen Variablen: {len(numeric_columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Es gibt lediglich drei numerische Variablen in unserem Datensatz. Dabei handelt es sich unter anderem um die Variablen `agent` und `company`, die ursprünglich kategorisch waren, aber durch die Anonymisierung in numerische Werte umgewandelt wurden."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv (3.12.11)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}