diverses
This commit is contained in:
parent
9f14e6c888
commit
89a8543159
434
notebooks/Datenformate.ipynb
Normal file
434
notebooks/Datenformate.ipynb
Normal file
@ -0,0 +1,434 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Datenformate\n",
|
||||
"Author: Prof. Dr. Yves Staudt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In diesem Code besprechen wir die Datenformate. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading Packages"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"# Feature Engine\n",
|
||||
"from feature_engine.imputation import (\n",
|
||||
" AddMissingIndicator,\n",
|
||||
" MeanMedianImputer,\n",
|
||||
" CategoricalImputer,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"from feature_engine.encoding import RareLabelEncoder, OneHotEncoder, CountFrequencyEncoder\n",
|
||||
"from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures\n",
|
||||
"from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection\n",
|
||||
"from feature_engine import transformation as vt\n",
|
||||
"from feature_engine.wrappers import SklearnTransformerWrapper\n",
|
||||
"\n",
|
||||
"# Scikit-Learn - Visualisation\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from sklearn.preprocessing import StandardScaler\n",
|
||||
"\n",
|
||||
"# Visualisation\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import plotly.express as px"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Loading Data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Grösse des Datensatzes: (119390, 32)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df = pd.read_csv(\"../Data/hotel_bookings.csv\")\n",
|
||||
"print(f\"Grösse des Datensatzes: {df.shape}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Manuelle Daten löschen \n",
|
||||
"Bestimmte Variablen wie das genaue Jahr, spezifische Tage oder Kalenderwochen bieten in einer Analyse nur begrenzten Mehrwert. Daher entfernen wir die Variablen *`arrival_date_year`*, *`arrival_date_week_number`* und *`arrival_date_day_of_month`*."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Shape of data frame after remove of variables: (119390, 29)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df = df.drop(columns=['arrival_date_year','arrival_date_week_number','arrival_date_day_of_month'])\n",
|
||||
"print(f\"Shape of data frame after remove of variables: {df.shape}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Missing Variables\n",
|
||||
"Zuerst schauen wir die Variablen und deren fehlende Werte an. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Liste der fehlenden Variablen: company 94.306893\n",
|
||||
"agent 13.686238\n",
|
||||
"country 0.408744\n",
|
||||
"children 0.003350\n",
|
||||
"dtype: float64\n",
|
||||
"Anzahl Variablen mit fehldenden Werten: 4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"missing_values = df.isnull().mean() * 100 \n",
|
||||
"missing_values = missing_values[missing_values > 0] \n",
|
||||
"\n",
|
||||
"missing_values = missing_values.sort_values(ascending=False)\n",
|
||||
"\n",
|
||||
"print(f\"Liste der fehlenden Variablen: {missing_values}\")\n",
|
||||
"print(f\"Anzahl Variablen mit fehldenden Werten: {len(missing_values)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Wir haben vier Variablen mit fehlenden Werten identifiziert. Eine dieser Variablen weist einen hohen Anteil fehlender Werte auf (94%), während die Variablen `country` und `children` nur einen geringen Anteil an fehlenden Werten haben. Für `country` und `children` könnte es sinnvoll sein, die betroffenen Zeilen einfach zu löschen. In diesem Bericht füllen wir jedoch die fehlenden Werte auf, um die Daten vollständig zu machen."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Train Test Split \n",
|
||||
"Bevor wir nun weiter die Daten aufbereiten, sollen die Daten in ein Train- und Test-Split aufgeteilt werden. Dieser Schritt muss zwingend vor dem Auffüllen gemacht werden. Wenn wir den gesamten Datensatz als Ganzes auffüllen, dann fliessen Informationen aus dem Test-Split unbeabsichtigt in den Train-Split ein (Data Leakage). Der Testdatensatz wäre dann nicht mehr \"ungesehen\". \n",
|
||||
"Daher werden wir die Datenaufbereitungsmethoden, wie das Auffüllen von fehlenden Werten, ausschliesslich auf dem Trainingsdatensatz erlernt und später auf den Test-Split überführt. Dies spiegelt auch den realen Anwendungsfall wieder, da neue Daten mit der gleichen und bereits bekannten Methodik vorverarbeitet werden müssen. \n",
|
||||
"\n",
|
||||
"Wir bereiten hier den Datensatz mit allgmeinen Schritten auf, wobei der Preis die Zielvariable ist."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The shape of the data set with training varialbes is: (119390, 28)\n",
|
||||
"The shape of the target variable is: (119390,)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"x = df.drop(columns = ['adr'])\n",
|
||||
"y = df['adr']\n",
|
||||
"print(f\"The shape of the data set with training varialbes is: {x.shape}\")\n",
|
||||
"print(f\"The shape of the target variable is: {y.shape}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Nachdem die erklärenden Variablen von der Zielvariable getrennt wurden, können wir nun die Train und Test Trennung durchführen."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The shape of the training sample is: (83573, 28)\n",
|
||||
"The shape of the test sample is: (35817, 28)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"X_train, X_test, y_train, y_test = train_test_split(\n",
|
||||
" x,\n",
|
||||
" y,\n",
|
||||
" test_size=0.3,\n",
|
||||
" random_state=0)\n",
|
||||
"\n",
|
||||
"print(f\"The shape of the training sample is: {X_train.shape}\")\n",
|
||||
"print(f\"The shape of the test sample is: {X_test.shape}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Daten Formate\n",
|
||||
"Wir haben gelernt, dass die Datenformate eine wichtige Rolle spielen, daher schauen wir uns die zuerst an. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Kategorielle Variablen"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"hotel object\n",
|
||||
"is_canceled int64\n",
|
||||
"lead_time int64\n",
|
||||
"arrival_date_month object\n",
|
||||
"stays_in_weekend_nights int64\n",
|
||||
"stays_in_week_nights int64\n",
|
||||
"adults int64\n",
|
||||
"children float64\n",
|
||||
"babies int64\n",
|
||||
"meal object\n",
|
||||
"country object\n",
|
||||
"market_segment object\n",
|
||||
"distribution_channel object\n",
|
||||
"is_repeated_guest int64\n",
|
||||
"previous_cancellations int64\n",
|
||||
"previous_bookings_not_canceled int64\n",
|
||||
"reserved_room_type object\n",
|
||||
"assigned_room_type object\n",
|
||||
"booking_changes int64\n",
|
||||
"deposit_type object\n",
|
||||
"agent float64\n",
|
||||
"company float64\n",
|
||||
"days_in_waiting_list int64\n",
|
||||
"customer_type object\n",
|
||||
"required_car_parking_spaces int64\n",
|
||||
"total_of_special_requests int64\n",
|
||||
"reservation_status object\n",
|
||||
"reservation_status_date object\n",
|
||||
"dtype: object"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"X_train.dtypes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Liste von kategorischen Variablen: Index(['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment',\n",
|
||||
" 'distribution_channel', 'reserved_room_type', 'assigned_room_type',\n",
|
||||
" 'deposit_type', 'customer_type', 'reservation_status',\n",
|
||||
" 'reservation_status_date'],\n",
|
||||
" dtype='object')\n",
|
||||
"Lange der Liste von kategorischen Variablen: 12\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Extract column names with data type 'object'\n",
|
||||
"# Dies inkludiert alle Spalten, welche mehrheitlich nicht durch numerische Werte besetzt werden. \n",
|
||||
"object_columns = X_train.select_dtypes(include=['object']).columns\n",
|
||||
"\n",
|
||||
"# Print the column names with 'object' data type\n",
|
||||
"print(f\"Liste von kategorischen Variablen: {object_columns}\")\n",
|
||||
"\n",
|
||||
"# Länge der Liste\n",
|
||||
"print(f\"Lange der Liste von kategorischen Variablen: {len(object_columns)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In einem weiteren Schritt schauen wir die Kardinalität der kategorischen Variablen an. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"hotel: 2\n",
|
||||
"arrival_date_month: 12\n",
|
||||
"meal: 5\n",
|
||||
"country: 162\n",
|
||||
"market_segment: 8\n",
|
||||
"distribution_channel: 5\n",
|
||||
"reserved_room_type: 10\n",
|
||||
"assigned_room_type: 11\n",
|
||||
"deposit_type: 3\n",
|
||||
"customer_type: 4\n",
|
||||
"reservation_status: 3\n",
|
||||
"reservation_status_date: 916\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for col in X_train.select_dtypes(include=['object']):\n",
|
||||
" cardinality = len(pd.Index(X_train[col].value_counts()))\n",
|
||||
" print(X_train[col].name + \": \" + str(cardinality))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Die Variable `hotel` ist binär und sollte immer mithilfe des k−1-One-Hot-Encoding umgewandelt werden. Die Variablen `country` und `reservation_status_date` weisen im Vergleich zu den anderen Variablen eine hohe Kardinalität auf. `reservation_status_date` ist eine Datumsvariable, die in ihrer ursprünglichen Form nicht direkt genutzt werden kann. Da Transformationen dieser Variable bereits umgesetzt wurden, wird sie in diesem Schritt entfernt."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"Shape of the data set after remove of unnecessary variables: (83573, 27)\n",
|
||||
"\n",
|
||||
"Shape of the data set after remove of unnecessary variables: (35817, 27)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"X_train = X_train.drop(columns=['reservation_status_date']) \n",
|
||||
"print(f\"\\nShape of the data set after remove of unnecessary variables: {X_train.shape}\")\n",
|
||||
"X_test = X_test.drop(columns=['reservation_status_date']) \n",
|
||||
"print(f\"\\nShape of the data set after remove of unnecessary variables: {X_test.shape}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Numerische Variablen"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Liste von numerischen Variablen: Index(['children', 'agent', 'company'], dtype='object')\n",
|
||||
"Lange der Liste von numerischen Variablen: 3\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Extract column names with data type 'float'\n",
|
||||
"numeric_columns = X_train.select_dtypes(include= ['float']).columns\n",
|
||||
"\n",
|
||||
"# Print the column names with 'float' data type\n",
|
||||
"print(f\"Liste von numerischen Variablen: {numeric_columns}\")\n",
|
||||
"\n",
|
||||
"# Länge der Liste\n",
|
||||
"print(f\"Lange der Liste von numerischen Variablen: {len(numeric_columns)}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Es gibt lediglich drei numerische Variablen in unserem Datensatz. Dabei handelt es sich unter anderem um die Variablen `agent` und `company`, die ursprünglich kategorisch waren, aber durch die Anonymisierung in numerische Werte umgewandelt wurden."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv (3.12.11)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
Loading…
x
Reference in New Issue
Block a user