Update pandas tutorial

main
Victor Khaustov 2022-05-17 16:01:51 +09:00
parent cf964f0303
commit 10a2b8fc1a
2 changed files with 60 additions and 206 deletions

View File

@ -906,7 +906,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Just for fun, if you want a [xkcd](https://xkcd.com)-style plot, just draw within a `with plt.xkcd()` section:"
"Just for fun, if you want an [xkcd](https://xkcd.com)-style plot, just draw within a `with plt.xkcd()` section:"
]
},
{

View File

@ -54,7 +54,7 @@
"metadata": {},
"source": [
"# `Series` objects\n",
"The `pandas` library contains these useful data structures:\n",
"The `pandas` library contains the following useful data structures:\n",
"* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).\n",
"* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).\n",
"* `Panel` objects. You can see a `Panel` as a dictionary of `DataFrame`s. These are less used, so we will not discuss them here."
@ -224,7 +224,7 @@
"metadata": {},
"source": [
"## Index labels\n",
"Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:"
"Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting from `0`) but you can also set the index labels manually:"
]
},
{
@ -441,7 +441,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh look! The first element has index label `2`. The element with index label `0` is absent from the slice:"
"Oh, look! The first element has index label `2`. The element with index label `0` is absent from the slice:"
]
},
{
@ -603,7 +603,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value. (ie. Not-a-Number means *missing*).\n",
"The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value (i.e. Not-a-Number means *missing*).\n",
"\n",
"Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the right index labels, you can have surprising results:"
]
@ -745,7 +745,6 @@
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]\n",
"s7 = pd.Series(temperatures, name=\"Temperature\")\n",
@ -757,7 +756,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) section of pandas' documentation, and look at the example code."
"There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) section of pandas' documentation, and look at the example code."
]
},
{
@ -772,7 +771,7 @@
"* it can handle timezones.\n",
"\n",
"## Time range\n",
"Let's start by creating a time series using `pd.date_range()`. This returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm."
"Let's start by creating a time series using `pd.date_range()`. It returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm."
]
},
{
@ -905,7 +904,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The resampling operation is actually a deferred operation, which is why we did not get a `Series` object, but a `DatetimeIndexResampler` object instead. To actually perform the resampling operation, we can simply call the `mean()` method: Pandas will compute the mean of every pair of consecutive hours:"
"The resampling operation is actually a deferred operation, which is why we did not get a `Series` object, but a `DatetimeIndexResampler` object instead. To actually perform the resampling operation, we can simply call the `mean()` method. Pandas will compute the mean of every pair of consecutive hours:"
]
},
{
@ -1020,7 +1019,7 @@
"metadata": {},
"source": [
"## Upsampling and interpolation\n",
"This was an example of downsampling. We can also upsample (ie. increase the frequency), but this creates holes in our data:"
"It was an example of downsampling. We can also upsample (i.e. increase the frequency), but it will create holes in our data:"
]
},
{
@ -1122,7 +1121,7 @@
"metadata": {},
"source": [
"## Timezones\n",
"By default datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize()` method:"
"By default, datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize()` method:"
]
},
{
@ -1162,7 +1161,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `-04:00` is now appended to all the datetimes. This means that these datetimes refer to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) - 4 hours.\n",
"Note that `-04:00` is now appended to all the datetimes. It means that these datetimes refer to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) - 4 hours.\n",
"\n",
"We can convert these datetimes to Paris time like this:"
]
@ -1273,7 +1272,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Fortunately using the `ambiguous` argument we can tell pandas to infer the right DST (Daylight Saving Time) based on the order of the ambiguous timestamps:"
"Fortunately, by using the `ambiguous` argument we can tell pandas to infer the right DST (Daylight Saving Time) based on the order of the ambiguous timestamps:"
]
},
{
@ -1457,7 +1456,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course we can create a `Series` with a `PeriodIndex`:"
"Of course, we can create a `Series` with a `PeriodIndex`:"
]
},
{
@ -1514,7 +1513,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can convert periods to timestamps by calling `to_timestamp`. By default this will give us the first day of each period, but by setting `how` and `freq`, we can get the last hour of each period:"
"We can convert periods to timestamps by calling `to_timestamp`. By default, it will give us the first day of each period, but by setting `how` and `freq`, we can get the last hour of each period:"
]
},
{
@ -1585,7 +1584,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas also provides many other time-related functions that we recommend you check out in the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html). To whet your appetite, here is one way to get the last business day of each month in 2016, at 9am:"
"Pandas also provides many other time-related functions that we recommend you check out in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). To whet your appetite, here is one way to get the last business day of each month in 2016, at 9am:"
]
},
{
@ -1998,7 +1997,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To specify missing values, you can either use `np.nan` or NumPy's masked arrays:"
"To specify missing values, you can use either `np.nan` or NumPy's masked arrays:"
]
},
{
@ -2072,7 +2071,7 @@
}
],
"source": [
"masked_array = np.ma.asarray(values, dtype=np.object)\n",
"masked_array = np.ma.asarray(values, dtype=object)\n",
"masked_array[(0, 2), (1, 2)] = np.ma.masked\n",
"d3 = pd.DataFrame(\n",
" masked_array,\n",
@ -2158,7 +2157,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possible to create a `DataFrame` with a dictionary (or list) of dictionaries (or list):"
"It is also possible to create a `DataFrame` with a dictionary (or list) of dictionaries (or lists):"
]
},
{
@ -2839,7 +2838,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (eg. there was no `bob` in `London`).\n",
"Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (e.g. there was no `bob` in `London`).\n",
"\n",
"Calling `unstack()` will do the reverse, once again creating many `NaN` values."
]
@ -3108,7 +3107,7 @@
"metadata": {},
"source": [
"## Most methods return modified copies\n",
"As you may have noticed, the `stack()` and `unstack()` methods do not modify the object they apply to. Instead, they work on a copy and return that copy. This is true of most methods in pandas."
"As you may have noticed, the `stack()` and `unstack()` methods do not modify the object they are called on. Instead, they work on a copy and return that copy. This is true of most methods in pandas."
]
},
{
@ -3479,7 +3478,7 @@
"metadata": {},
"source": [
"## Adding and removing columns\n",
"You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following work fine:"
"You can generally treat `DataFrame` objects like dictionaries of `Series`, so the following works fine:"
]
},
{
@ -3662,7 +3661,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:"
"When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:"
]
},
{
@ -4077,7 +4076,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assigment calls, but it does not work because the `people` object is not actually modified by the first assignment:"
"Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assignment calls, but it does not work because the `people` object is not actually modified by the first assignment:"
]
},
{
@ -4220,7 +4219,7 @@
"metadata": {},
"source": [
"## Evaluating an expression\n",
"A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed."
"A great feature supported by pandas is expression evaluation. It relies on the `numexpr` library which must be installed."
]
},
{
@ -4523,7 +4522,7 @@
"metadata": {},
"source": [
"## Sorting a `DataFrame`\n",
"You can sort a `DataFrame` by calling its `sort_index` method. By default it sorts the rows by their index label, in ascending order, but let's reverse the order:"
"You can sort a `DataFrame` by calling its `sort_index` method. By default, it sorts the rows by their index label, in ascending order, but let's reverse the order:"
]
},
{
@ -4854,6 +4853,7 @@
}
],
"source": [
"people.sort_values(by=\"body_mass_index\", inplace=True)\n",
"people.plot(kind=\"line\", x=\"body_mass_index\", y=[\"height\", \"weight\"])\n",
"plt.show()"
]
@ -4892,7 +4892,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, there are way too many options to list here: the best option is to scroll through the [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code."
"Again, there are way too many options to list here: the best option is to scroll through the [Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) page in pandas' documentation, find the plot you are interested in and look at the example code."
]
},
{
@ -4900,7 +4900,7 @@
"metadata": {},
"source": [
"## Operations on `DataFrame`s\n",
"Although `DataFrame`s do not try to mimick NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:"
"Although `DataFrame`s do not try to mimic NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:"
]
},
{
@ -5798,7 +5798,7 @@
"## Handling missing data\n",
"Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.\n",
" \n",
"Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by a any value using the `fillna()` method:"
"Let's try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of `NaN`. We can replace all `NaN` values by any value using the `fillna()` method:"
]
},
{
@ -6466,7 +6466,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well I guess some teachers probably do). So let's call the `dropna()` method to get rid of rows that are full of `NaN`s:"
"There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well, I guess some teachers probably do). So let's call the `dropna()` method to get rid of rows that are full of `NaN`s:"
]
},
{
@ -9463,7 +9463,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course there's also a `tail()` function to view the bottom 5 rows. You can pass the number of rows you want:"
"Of course, there's also a `tail()` function to view the bottom 5 rows. You can pass the number of rows you want:"
]
},
{
@ -9594,7 +9594,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `info()` method prints out a summary of each columns contents:"
"The `info()` method prints out a summary of each column's contents:"
]
},
{
@ -10239,7 +10239,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As you might guess, there are similar `read_json`, `read_html`, `read_excel` functions as well. We can also read data straight from the Internet. For example, let's load the top 1,000 U.S. cities from github:"
"As you might guess, there are similar `read_json`, `read_html`, `read_excel` functions as well. We can also read data straight from the Internet. For example, let's load the top 1,000 U.S. cities from GitHub:"
]
},
{
@ -10351,7 +10351,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are more options available, in particular regarding datetime format. Check out the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for more details."
"There are more options available, in particular regarding datetime format. Check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for more details."
]
},
{
@ -10361,7 +10361,7 @@
"# Combining `DataFrame`s\n",
"\n",
"## SQL-like joins\n",
"One powerful feature of pandas is it's ability to perform SQL-like joins on `DataFrame`s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple simple `DataFrame`s:"
"One powerful feature of pandas is its ability to perform SQL-like joins on `DataFrame`s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple of simple `DataFrame`s:"
]
},
{
@ -10761,7 +10761,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course `LEFT OUTER JOIN` is also available by setting `how=\"left\"`: only the cities present in the left `DataFrame` end up in the result. Similarly, with `how=\"right\"` only cities in the right `DataFrame` appear in the result. For example:"
"Of course, `LEFT OUTER JOIN` is also available by setting `how=\"left\"`: only the cities present in the left `DataFrame` end up in the result. Similarly, with `how=\"right\"` only cities in the right `DataFrame` appear in the result. For example:"
]
},
{
@ -11101,7 +11101,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (eg. 3). Pandas handles this rather gracefully:"
"Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (e.g. 3). Pandas handles this rather gracefully:"
]
},
{
@ -11573,7 +11573,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case it really does not make much sense because the indices do not align well (eg. Cleveland and San Francisco end up on the same row, because they shared the index label `3`). So let's reindex the `DataFrame`s by city name before concatenating:"
"In this case it really does not make much sense because the indices do not align well (e.g. Cleveland and San Francisco end up on the same row, because they shared the index label `3`). So let's reindex the `DataFrame`s by city name before concatenating:"
]
},
{
@ -11690,152 +11690,6 @@
"This looks a lot like a `FULL OUTER JOIN`, except that the `state` columns were not renamed to `state_x` and `state_y`, and the `city` column is now the index."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `append()` method is a useful shorthand for concatenating `DataFrame`s vertically:"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>city</th>\n",
" <th>lat</th>\n",
" <th>lng</th>\n",
" <th>population</th>\n",
" <th>state</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>San Francisco</td>\n",
" <td>37.781334</td>\n",
" <td>-122.416728</td>\n",
" <td>NaN</td>\n",
" <td>CA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>New York</td>\n",
" <td>40.705649</td>\n",
" <td>-74.008344</td>\n",
" <td>NaN</td>\n",
" <td>NY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Miami</td>\n",
" <td>25.791100</td>\n",
" <td>-80.320733</td>\n",
" <td>NaN</td>\n",
" <td>FL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Cleveland</td>\n",
" <td>41.473508</td>\n",
" <td>-81.739791</td>\n",
" <td>NaN</td>\n",
" <td>OH</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Salt Lake City</td>\n",
" <td>40.755851</td>\n",
" <td>-111.896657</td>\n",
" <td>NaN</td>\n",
" <td>UT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>San Francisco</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>808976.0</td>\n",
" <td>California</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>New York</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>8363710.0</td>\n",
" <td>New-York</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Miami</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>413201.0</td>\n",
" <td>Florida</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Houston</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2242193.0</td>\n",
" <td>Texas</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" city lat lng population state\n",
"0 San Francisco 37.781334 -122.416728 NaN CA\n",
"1 New York 40.705649 -74.008344 NaN NY\n",
"2 Miami 25.791100 -80.320733 NaN FL\n",
"3 Cleveland 41.473508 -81.739791 NaN OH\n",
"4 Salt Lake City 40.755851 -111.896657 NaN UT\n",
"3 San Francisco NaN NaN 808976.0 California\n",
"4 New York NaN NaN 8363710.0 New-York\n",
"5 Miami NaN NaN 413201.0 Florida\n",
"6 Houston NaN NaN 2242193.0 Texas"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"city_loc.append(city_pop)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As always in pandas, the `append()` method does *not* actually modify `city_loc`: it works on a copy and returns the modified copy."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -12149,8 +12003,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What next?\n",
"As you probably noticed by now, pandas is quite a large library with *many* features. Although we went through the most important features, there is still a lot to discover. Probably the best way to learn more is to get your hands dirty with some real-life data. It is also a good idea to go through pandas' excellent [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html), in particular the [Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html)."
"# What's next?\n",
"As you probably noticed by now, pandas is quite a large library with *many* features. Although we went through the most important features, there is still a lot to discover. Probably the best way to learn more is to get your hands dirty with some real-life data. It is also a good idea to go through pandas' excellent [documentation](https://pandas.pydata.org/pandas-docs/stable/index.html), in particular the [Cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html)."
]
},
{