"*The `pandas` library provides high-performance, easy-to-use data structures and data analysis tools. The main data structure is the `DataFrame`, which you can think of as a spreadsheet (including column names and row labels).*\n",
"* NumPy – if you are not familiar with NumPy, we recommend that you go through the [NumPy tutorial](tools_numpy.ipynb) now.\n",
"\n",
"## Setup\n",
"First, let's make sure this notebook works well in both python 2 and 3:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from __future__ import division\n",
"from __future__ import print_function\n",
"from __future__ import unicode_literals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's import `pandas`. People usually import it as `pd`:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `Series` objects\n",
"The `pandas` library contains these useful data structures:\n",
"* `Series` objects, that we will discuss now. A `Series` object is similar to a column in a spreadsheet (with a column name and row labels).\n",
"* `DataFrame` objects. You can see this as a full spreadsheet (with column names and row labels).\n",
"* `Panel` objects. You can see a `Panel` a a dictionary of `DataFrame`s (less used). These are less used, so we will not discuss them here.\n",
"\n",
"### Creating a `Series`\n",
"Let's start by creating our first `Series` object!"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series([2,-1,3,5])\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Similar to a 1D `ndarray`\n",
"`Series` objects behave much like one-dimensional NumPy `ndarray`s, and you can often pass them as parameters to NumPy functions:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"np.exp(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Arithmetic operations on `Series` are also possible, and they apply *elementwise*, just like for `ndarray`s:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s + pd.Series([1000,2000,3000,4000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to NumPy, if you add a single number to a `Series`, that number is added to all items in the `Series`:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s + 1000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same is true for all binary operations such as `*` or `/`, and even conditional operations:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s < 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Index labels\n",
"Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the index of the item in the `Series` but you can also set the index labels manually:"
"You can control which elements you want to include in the `Series` and in what order by passing a second argument to the constructor with the list of desired index labels:"
"When an operation involves multiple `Series` objects, `pandas` automatically aligns items by matching index labels."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(s2.keys())\n",
"print(s3.keys())\n",
"s2 + s3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value (ie. Not-a-Number means *missing*).\n",
"\n",
"Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the right index labels, you can have surprising results:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s5 = pd.Series([1000,1000,1000,1000])\n",
"print(\"s2 =\", s2.values)\n",
"print(\"s5 =\", s5.values)\n",
"print(\"s2 + s5 =\")\n",
"s2 + s5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Init with a scalar\n",
"You can also initialize a `Series` object using a scalar and a list of index labels: all items will be set to the scalar."
"Pandas makes it easy to plot `Series` data using matplotlib (for more details on matplotlib, check out the [matplotlib tutorial](tools_matplotlib.ipynb)). Just import matplotlib and call the `plot` method:"
"There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) section of pandas' documentation, and look at the example code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `DataFrame` objects\n",
"A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can think of them as dictionaries of `Series` objects.\n",
"\n",
"### Creating a `DataFrame`\n",
"You can create a DataFrame by passing a dictionary of `Series` objects:"
"Note that DataFrames are displayed nicely in Jupyter notebooks! Also, note that `Series` names are ignored (`\"year\"` was dropped)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access columns pretty much as you would expect. They are returned as `Series` objects:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people[\"birthyear\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you pass a list of columns and/or index row labels to the `DataFrame` constructor, it will guarantee that these columns and/or rows will exist, in that order, and no other column/row will exist. For example:"
"Another convenient way to create a `DataFrame` is to pass all the values to the constructor as an `ndarray`, and specify the column names and row index labels separately:"
"You can now get a `DataFrame` containing all the `\"public\"` columns very simply:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"d5[\"public\"]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"d5[\"public\", \"hobby\"] # Same result as d4[\"public\"][\"hobby\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Accessing rows\n",
"Let's go back to the `people` `DataFrame`:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `loc` attribute lets you access rows instead of columns. The result is `Series` object in which the `DataFrame`'s column names are mapped to row index labels:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.loc[\"charles\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also access rows by location using the `iloc` attribute:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.iloc[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also get a slice of rows, and this returns a `DataFrame` object:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.iloc[1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, you can pass a boolean array to get the matching rows:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people[np.array([True, False, True])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is most useful when combined with boolean expressions:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people[people[\"birthyear\"] < 1990]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding and removing columns\n",
"You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following work fine:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people[\"age\"] = 2016 - people[\"birthyear\"] # adds a new column \"age\"\n",
"Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assigment calls, but it does not work because the `people` object is not actually modified by the first assignment:"
"But fear not, there is a simple solution. You can pass a function to the `assign` method (typically a `lambda` function), and this function will be called with the `DataFrame` as a parameter:"
"Assignment expressions are also supported, and contrary to the `assign` method, this does not create a copy of the `DataFrame`, instead it directly modifies it:"
"The `query` method lets you filter a `DataFrame` based on a query expression:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.query(\"age > 30 and pets == 0\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sorting a `DataFrame`\n",
"You can sort a `DataFrame` by calling its `sort_index` method. By default it sorts the rows by their index label, in ascending order, but let's reverse the order:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.sort_index(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `sort_index` returned a sorted *copy* of the `DataFrame`. To modify `people` directly, we can set the `inplace` argument to `True`. Also, we can sort the columns instead of the rows by setting `axis=1`:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.sort_index(axis=1, inplace=True)\n",
"people"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To sort the `DataFrame` by the values instead of the labels, we can use `sort_values` and specify the column to sort by:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.sort_values(by=\"age\", inplace=True)\n",
"people"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting a `DataFrame`\n",
"Just like for `Series`, pandas makes it easy to draw nice graphs based on a `DataFrame`.\n",
"\n",
"For example, it is trivial to create a line plot from a `DataFrame`'s data by calling its `plot` method:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"people.plot(kind = \"line\", x = \"body_mass_index\", y = [\"height\", \"weight\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass extra arguments supported by matplotlib's functions. For example, we can create scatterplot and pass it a list of sizes using the `s` argument of matplotlib's `scatter` function:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"people.plot(kind = \"scatter\", x = \"height\", y = \"weight\", s=[40, 120, 200])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, there are way too many options to list here: the best option is to scroll through the [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Operations on `DataFrame`s\n",
"Although `DataFrame`s do not try to mimick NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:"
"You can apply NumPy mathematical functions on a `DataFrame`: the function is applied to all values:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"np.sqrt(grades)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, adding a single value to a `DataFrame` will add that value to all elements in the `DataFrame`. This is called *broadcasting*:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"grades + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, the same is true for all other binary operations, including arithmetic (`*`,`/`,`**`...) and conditional (`>`, `==`...) operations:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"grades >= 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aggregation operations, such as computing the `max`, the `sum` or the `mean` of a `DataFrame`, apply to each column, and you get back a `Series` object:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"grades.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `all` method is also an aggregation operation: it checks whether all values are `True` or not. Let's see during which months all students got a grade greater than `5`:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"(grades > 5).all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of these functions take an optional `axis` parameter which lets you specify along which axis of the `DataFrame` you want the operation executed. The default is `axis=0`, meaning that the operation is executed vertically (on each column). You can set `axis=1` to execute the operation horizontally (on each row). For example, let's find out which students had all grades greater than `5`:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"(grades > 5).all(axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `any` method returns `True` if any value is True. Let's see who got at least one grade 10:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"(grades == 10).any(axis = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you add a `Series` object to a `DataFrame` (or execute any other binary operation), pandas attempts to broadcast the operation to all *rows* in the `DataFrame`. This only works if the `Series` has the same size as the `DataFrame`s rows. For example, let's substract the `mean` of the `DataFrame` (a `Series` object) from the `DataFrame`:"
"We substracted `7.75` from all September grades, `8.75` from October grades and `7.50` from November grades. It is equivalent to substracting this `DataFrame`:"
"If you want to substract the global mean from every grade, here is one way to do it:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"grades - grades.values.mean() # substracts the global mean (8.00) from all grades"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automatic alignment\n",
"Similar to `Series`, when operating on multiple `DataFrame`s, pandas automatically aligns them by row index label, but also by column names. Let's create a `DataFrame` with bonus points for each person from October to December:"
"Looks like the addition worked in some cases but way too many elements are now empty. That's because when aligning the `DataFrame`s, some columns and rows were only present on one side, and thus they were considered missing on the other side (`NaN`). Then adding `NaN` to a number results in `NaN`, hence the result.\n",
"\n",
"### Handling missing data\n",
"Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.\n",
" \n",
"Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by a any value using the `fillna` method:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"(grades + bonus_points).fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's a bit unfair that we're setting grades to zero in September, though. Perhaps we should decide that missing grades are missing grades, but missing bonus points should be replaced by zeros:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fixed_bonus_points = bonus_points.fillna(0)\n",
"fixed_bonus_points.insert(0, \"sep\", 0)\n",
"fixed_bonus_points.loc[\"alice\"] = 0\n",
"grades + fixed_bonus_points"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's much better: although we made up some data, we have not been too unfair.\n",
"\n",
"Another way to handle missing data is to interpolate. Let's look at the `bonus_points` `DataFrame` again:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"bonus_points"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's call the `interpolate` method. By default, it interpolates vertically (`axis=0`), so let's tell it to interpolate horizontally (`axis=1`)."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"bonus_points.interpolate(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bob had 0 bonus points in October, and 2 in December. When we interpolate for November, we get the mean: 1 bonus point. Colin had 1 bonus point in November, but we do not know how many bonus points he had in September, so we cannot interpolate, this is why there is still a missing value in October after interpolation. To fix this, we can set the September bonus points to 0 before interpolation."
"Great, now we have reasonable bonus points everywhere. Let's find out the final grades:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"grades + better_bonus_points"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well I guess some teachers probably do).\n",
"\n",
"It is slightly annoying that the September column ends up on the right. This is because the `DataFrame`s we are adding do not have the exact same columns (the `grades` `DataFrame` is missing the `\"dec\"` column), so to make things predictable, pandas orders the final columns alphabetically. To fix this, we can simply add the missing column before adding:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"grades[\"dec\"] = np.nan\n",
"final_grades = grades + better_bonus_points\n",
"final_grades"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aggregating with `groupby`\n",
"Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.\n",
"\n",
"First, let's add some extra data about each person so we can group them:"
"That was easy! Note that the `NaN` values have simply been skipped."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview functions\n",
"When dealing with large `DataFrames`, it is useful to get a quick overview of its content. Pandas offers a few functions for this. First, let's create a large `DataFrame` with a mix of numeric values, missing values and text values. Notice how Jupyter displays only the corners of the `DataFrame`:"