PANDAS Mac OS

This post is titled as “fun with Pandas Groupby, aggregate, and unstack”, but it addresses some of the pain points I face when doing mundane data-munging activities. Every time I do this I start from scratch and solved them in different ways. The purpose of this post is to record at least a couple of solutions so I don’t have to go through the pain again.

The high level problem is pretty simple and it goes something like this. You have a dataframe and want to groupby more than one variables, compute some summarized statistics using the remaining variables and use them to do some analysis. Typically plotting something really quick. You can easily imagine a number of variants of this problems. One of the pain points for me is lack of full understanding multi-indexing operations that Pandas enables. So far I have skipped dealing with multi-indexes and do not see myself confronting anytime soon :-). Along the way I have discovered the use of Pandas’ unstack() function multiple times. It is useful for pivot like operation.

Mac OS 安装pandas. 安装pandas时,碰到很多坑,现在将可行的安装步骤总结如下: 1.安装pip. 下载安装脚本 https://bootstrap.pypa.io/get. Pandas is an data analysis module for the Python programming language. It is open-source and BSD-licensed. Pandas is used in a wide range of fields including academia, finance, economics, statistics, analytics, etc.

Let us work through an example of this with gapminder dataset.

We will load gapminder dataset directly from github page.

Pandas groupby() on multiple variables

Let us groupby two variables and perform computing mean values for the rest of the numerical variables.

One of the ways to compute mean values for remaining variables is to use mean() function directly on the grouped object.

When we perform groupby() operation with multiple variables, we get a dataframe with multiple indices as shown below. We have two indices followed by three columns with average values, but with the original column names.

We can use the columns to get the column names. Note that it gives three column names, not the first two index names.

Pandas reset_index() to convert Multi-Index to Columns

We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.

Pandas agg() function to summarize grouped data

Now the simple dataframe is ready for further downstream analysis. One nagging issue is that using mean() function on grouped dataframe has the same column names. Although now we have mean values of the three columns. One can manually change the column names. Another option is to use Pandas agg() function instead of mean().

With agg() function, we need to specify the variable we need to do summary operation. In this example, we have three variables and we want to compute mean. We can specify that as a dictionary to agg() function.

Now we get mean population, life expectancy, gdpPercap for each year and continent. We again get a multi-indexed dataframe with continent and year as indices and three columns. And it looks like this.

Accessing Column Names and Index names from Multi-Index Dataframe

Pandas Most Common

Let us check the column names of the resulting dataframe. Now we get a MultiIndex names as a list of tuples. Each tuple gives us the original column name and the name of aggregation operation we did. In this example, we used mean. It can be other summary operations as well.

The column names/information are in two levels. We can access the values in each level using Pandas’ get_level_values() function.

With columns.get_level_values(0), we get the column names.

PANDAS

With get_level_values(1), we get the second level of column names, which is the aggregation function we used.

Similarly, we can also get the index values using index.get_level_values() function. Here we get the values of the first index.

similarly, we can get the values of second index using index.get_level_values(1).

Fixing Column names after Pandas agg() function to summarize grouped data

Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.

Here we combine them to create new column names using Pandas map() function.

We can change the column names of the dataframe.

And now we have summarized dataframe with correct names. Using agg() function to summarize takes few more lines, but with right column names, when compared to Pandas’ mean() function.

The resulting dataframe is still Multi-Indexed and we can use reset_index() function to convert the row index or rownames as columns as before.

And we get a simple dataframe with right column names.

Grouped Line Plots with Seaborn’s lineplot

In the above example, we computed summarized values for multiple columns. Typically, one might be interested in summary value of a single column, and making some visualization using the index variables. Let us take the approach that is similar to above example using agg() function.

In this example, we use single variable for computing summarized/aggregated values. Here we compute median life expectancy for each year and continent. We also create new appropriate column name as above.

Note that, our resulting data is in tidy form and we can use Seaborn’s lineplot to make grouped line plots of median life expectancy over time for 5 continents.

We get nice multiple lineplots with Seaborn.

Pandas unstack function to get data in wide form

Pandas Mac Os X

For some reason, if you don’t want the resulting data to be in tidy form, we can use unstack() function after computing the summarized values.

Here we use Pandas’ unstack() function after computing median lifeExp for each group. And we get our data in wide form. When you groupby multiple variables, by default the last level will be on the rows in the wide form.

If we want wide form data, but with different variable on column, we can specify the level or variable name to unstack() function. For example, to get year on columns, we would use unstack(“year”) as shown below.

One of the advantages with using unstack() is that we have sidestepped the multi-index to simple index and we can quickly make exploratory data visualization with different variables. In this example below, we make a line plot again between year and median lifeExp for each continent. However this time we simply use Pandas’ plot function by chaining the plot() function to the results from unstack().

Pandas Most_common

And we get almost similar plot as before, since Pandas’ plot function calls Matplotlib under the hood.

Related posts: