Remove one level of headers

main
Aurélien Geron 2016-03-03 18:40:31 +01:00
parent e8d45964b8
commit 8370cafbb7
2 changed files with 314 additions and 283 deletions

File diff suppressed because it is too large Load Diff

View File

@ -4,10 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tools - pandas\n",
"**Tools - pandas**\n",
"\n",
"*The `pandas` library provides high-performance, easy-to-use data structures and data analysis tools. The main data structure is the `DataFrame`, which you can think of as an in-memory 2D table (like a spreadsheet, with column names and row labels). Many features available in Excel are available programmatically, such as creating pivot tables, computing columns based on other columns, plotting graphs, etc. You can also group rows by column value, or join tables much like in SQL. Pandas is also great at handling time series.*\n",
"\n",
"**Prerequisites:**\n",
"Prerequisites:\n",
"* NumPy if you are not familiar with NumPy, we recommend that you go through the [NumPy tutorial](tools_numpy.ipynb) now."
]
},
@ -15,7 +16,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"# Setup\n",
"First, let's make sure this notebook works well in both python 2 and 3:"
]
},
@ -54,7 +55,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## `Series` objects\n",
"# `Series` objects\n",
"The `pandas` library contains these useful data structures:\n",
"* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).\n",
"* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).\n",
@ -65,7 +66,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating a `Series`\n",
"## Creating a `Series`\n",
"Let's start by creating our first `Series` object!"
]
},
@ -85,7 +86,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Similar to a 1D `ndarray`\n",
"## Similar to a 1D `ndarray`\n",
"`Series` objects behave much like one-dimensional NumPy `ndarray`s, and you can often pass them as parameters to NumPy functions:"
]
},
@ -159,7 +160,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Index labels\n",
"## Index labels\n",
"Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:"
]
},
@ -332,7 +333,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Init from `dict`\n",
"## Init from `dict`\n",
"You can create a `Series` object from a `dict`. The keys will be used as index labels:"
]
},
@ -372,7 +373,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatic alignment\n",
"## Automatic alignment\n",
"When an operation involves multiple `Series` objects, `pandas` automatically aligns items by matching index labels."
]
},
@ -425,7 +426,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Init with a scalar\n",
"## Init with a scalar\n",
"You can also initialize a `Series` object using a scalar and a list of index labels: all items will be set to the scalar."
]
},
@ -445,7 +446,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### `Series` name\n",
"## `Series` name\n",
"A `Series` can have a `name`:"
]
},
@ -465,7 +466,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting a `Series`\n",
"## Plotting a `Series`\n",
"Pandas makes it easy to plot `Series` data using matplotlib (for more details on matplotlib, check out the [matplotlib tutorial](tools_matplotlib.ipynb)). Just import matplotlib and call the `plot` method:"
]
},
@ -497,14 +498,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time series\n",
"# Handling time\n",
"Many datasets have timestamps, and pandas is awesome at manipulating such data:\n",
"* it can represent periods (such as 2016Q3) and frequencies (such as \"monthly\"),\n",
"* it can convert periods to actual timestamps, and *vice versa*,\n",
"* it can resample data and aggregate values any way you like,\n",
"* it can handle timezones.\n",
"\n",
"### Time range\n",
"## Time range\n",
"Let's start by creating a time series using `timerange`. This returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm."
]
},
@ -564,7 +565,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Resampling\n",
"## Resampling\n",
"Pandas let's us resample a time series very simply. Just call the `resample` method and specify a new frequency:"
]
},
@ -622,7 +623,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upsampling and interpolation\n",
"## Upsampling and interpolation\n",
"This was an example of downsampling. We can also upsample (ie. increase the frequency), but this creates holes in our data:"
]
},
@ -676,7 +677,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Timezones\n",
"## Timezones\n",
"By default datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize` method:"
]
},
@ -776,7 +777,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Periods\n",
"## Periods\n",
"The `period_range` function returns a `PeriodIndex` instead of a `DatetimeIndex`. For example, let's get all quarters in 2016 and 2017:"
]
},
@ -957,10 +958,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## `DataFrame` objects\n",
"# `DataFrame` objects\n",
"A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can define expressions to compute columns based on other columns, create pivot-tables, group rows, draw graphs, etc. You can see `DataFrame`s as dictionaries of `Series`.\n",
"\n",
"### Creating a `DataFrame`\n",
"## Creating a `DataFrame`\n",
"You can create a DataFrame by passing a dictionary of `Series` objects:"
]
},
@ -1156,7 +1157,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multi-indexing\n",
"## Multi-indexing\n",
"If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:"
]
},
@ -1216,7 +1217,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dropping a level\n",
"## Dropping a level\n",
"Let's look at `d5` again:"
]
},
@ -1254,7 +1255,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transposing\n",
"## Transposing\n",
"You can swap columns and indices using the `T` attribute:"
]
},
@ -1274,7 +1275,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stacking and unstacking levels\n",
"## Stacking and unstacking levels\n",
"Calling the `stack` method will push the lowest column level after the lowest index:"
]
},
@ -1354,7 +1355,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Most methods return modified copies\n",
"## Most methods return modified copies\n",
"As you may have noticed, the `stack` and `unstack` methods do not modify the object they apply to. Instead, they work on a copy and return that copy. This is true of most methods in pandas."
]
},
@ -1362,7 +1363,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Accessing rows\n",
"## Accessing rows\n",
"Let's go back to the `people` `DataFrame`:"
]
},
@ -1471,7 +1472,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding and removing columns\n",
"## Adding and removing columns\n",
"You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following work fine:"
]
},
@ -1555,7 +1556,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assigning new columns\n",
"## Assigning new columns\n",
"You can also create new columns by calling the `assign` method. Note that this returns a new `DataFrame` object, the original is not modified:"
]
},
@ -1672,7 +1673,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluating an expression\n",
"## Evaluating an expression\n",
"A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed."
]
},
@ -1730,7 +1731,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Querying a `DataFrame`\n",
"## Querying a `DataFrame`\n",
"The `query` method lets you filter a `DataFrame` based on a query expression:"
]
},
@ -1749,7 +1750,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sorting a `DataFrame`\n",
"## Sorting a `DataFrame`\n",
"You can sort a `DataFrame` by calling its `sort_index` method. By default it sorts the rows by their index label, in ascending order, but let's reverse the order:"
]
},
@ -1806,7 +1807,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting a `DataFrame`\n",
"## Plotting a `DataFrame`\n",
"Just like for `Series`, pandas makes it easy to draw nice graphs based on a `DataFrame`.\n",
"\n",
"For example, it is trivial to create a line plot from a `DataFrame`'s data by calling its `plot` method:"
@ -1855,7 +1856,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Operations on `DataFrame`s\n",
"## Operations on `DataFrame`s\n",
"Although `DataFrame`s do not try to mimick NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:"
]
},
@ -2058,7 +2059,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatic alignment\n",
"## Automatic alignment\n",
"Similar to `Series`, when operating on multiple `DataFrame`s, pandas automatically aligns them by row index label, but also by column names. Let's create a `DataFrame` with bonus points for each person from October to December:"
]
},
@ -2093,7 +2094,7 @@
"source": [
"Looks like the addition worked in some cases but way too many elements are now empty. That's because when aligning the `DataFrame`s, some columns and rows were only present on one side, and thus they were considered missing on the other side (`NaN`). Then adding `NaN` to a number results in `NaN`, hence the result.\n",
"\n",
"### Handling missing data\n",
"## Handling missing data\n",
"Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.\n",
" \n",
"Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by a any value using the `fillna` method:"
@ -2274,7 +2275,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aggregating with `groupby`\n",
"## Aggregating with `groupby`\n",
"Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.\n",
"\n",
"First, let's add some extra data about each person so we can group them, and let's go back to the `final_grades` `DataFrame` so we can see how `NaN` values are handled:"
@ -2551,7 +2552,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving & loading\n",
"# Saving & loading\n",
"Pandas can save `DataFrame`s to various backends, including file formats such as CSV, Excel, JSON, HTML and HDF5, or to a SQL database. Let's create a `DataFrame` to demonstrate this:"
]
},
@ -2575,7 +2576,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving\n",
"## Saving\n",
"Let's save it to CSV, HTML and JSON:"
]
},
@ -2641,7 +2642,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading\n",
"## Loading\n",
"Now let's load our CSV file back into a `DataFrame`:"
]
},
@ -2693,9 +2694,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combining `DataFrame`s\n",
"# Combining `DataFrame`s\n",
"\n",
"### SQL-like joins\n",
"## SQL-like joins\n",
"One powerful feature of pandas is it's ability to perform SQL-like joins on `DataFrame`s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple simple `DataFrame`s:"
]
},
@ -2817,7 +2818,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Concatenation\n",
"## Concatenation\n",
"Rather than joining `DataFrame`s, we may just want to concatenate them. That's what `concat` is for:"
]
},
@ -2961,7 +2962,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categories\n",
"# Categories\n",
"It is quite frequent to have values that represent categories, for example `1` for female and `2` for male, or `\"A\"` for Good, `\"B\"` for Average, `\"C\"` for Bad. These categorical values can be hard to read and cumbersome to handle, but fortunately pandas makes it easy. To illustrate this, let's take the `city_pop` `DataFrame` we created earlier, and add a column that represents a category:"
]
},
@ -3062,6 +3063,13 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
},
"toc": {
"toc_cell": false,
"toc_number_sections": true,
"toc_section_display": "none",
"toc_threshold": 6,
"toc_window_display": true
}
},
"nbformat": 4,