Lesson 1: Intro to Pandas

Lesson Overview

In this lesson, we'll be covering the following topics:

Additional resources:

What is Pandas?

dataframe

readwrite

Why Pandas?

Getting Started

First, we need to import the pandas library:

We'll be working with data about countries around the world, from the Gapminder foundation. You can view the data table online here.

Column Description
country Country name
population Population in the country
region Continent the country belongs to
sub_region Sub regions as defined by
income_group Income group as specified by the world bank
life_expectancy The average number of years a newborn child would
live if mortality patterns were to stay the same
gdp_per_capita GDP per capita (in USD) adjusted
for differences in purchasing power
children_per_woman Number of children born to each woman
child_mortality Deaths of children under 5 years
of age per 1000 live births
pop_density Average number of people per km$^2$
years_in_school_men Average number of years attending primary, secondary, and tertiary school for 25-36 years old men
years_in_school_women Average number of years attending primary, secondary, and tertiary school for 25-36 years old women

Reading a CSV file

We'll use the function read_csv() to load the data into our notebook

What type of object is world?

For large DataFrames, it's often useful to display just the first few or last few rows:

The head() method returns a new DataFrame consisting of the first n rows (default 5)

Pro Tips!

  • To display the documentation for this method within Jupyter notebook, you can run the command world.head? or press Shift-Tab within the parentheses of world.head()
  • To see other methods available for the DataFrame, type world. followed by Tab for auto-complete options

First two rows:

What do you think the tail method does?

Data at a Glance

pandas provides many ways to quickly and easily summarize your data:

Number of rows and columns:

General information about the DataFrame can be obtained with the info() method:

If we just want a list of the column names, we can use the columns attribute:

Simple Summary Statistics

The describe() method computes simple summary statistics for a DataFrame:

The describe() method is a convenient way to quickly summarize the averages, extremes, and variability of each numerical data column.

You can look at each statistic individually with methods such as mean(), median(), min(), max(),std(), and count()

Exercise 1.1

a) Initial setup (you can skip to part b if you've already done this):

b) Based on the output of world.info(), what data type is the pop_density column?

c) Based on the output of world.describe(), what are the minimum and maximum years in this data?

For b) and c) you can create a Markdown cell in your notebook and write your answer there.

Break Time!

Saving to CSV

We can save our data locally to a CSV file using the to_csv() method:

Let's check out our new file in the JupyterLab CSV viewer!

Now that the data is saved locally, it can be loaded from the local path instead of downloading from the URL:

Selecting Columns

Similar to a dictionary, we can index a specific column of a DataFrame using the column name inside square brackets:

Pro Tip: In Jupyter notebooks, auto-complete works for DataFrame column names!

Note: The numbers on the left are the DataFrame's index, which was automatically added by pandas when the data was loaded

What type of object is this?

It's a Series, another data structure from the pandas library.

Many of the methods we use on a DataFrame can also be used on a Series, and vice versa

Select multiple columns:

Note the double square brackets!

When you select more than one column, the output is a DataFrame:

If you'll be frequently using a particular subset, it's often helpful to assign it to a separate variable

When selecting a larger number of columns, you may want to define the list of column names as a separate variable

Bonus: Unique Values in a Column

A helpful way to summarize categorical data (such as names of countries and regions) is use the value_counts() method to count the unique values in that column:

The output above tells us, for example, that 644 of the observations in our data are for Sub-Saharan Africa (recall that each row is an observation corresponding to a single country in a single year).

If we just want a list of unique values, we can use the unique() method:

If we just want the number of unique values, we can use the nunique() method:

Selecting Rows

We can extract rows from a DataFrame or Series based on a criteria

For example, suppose we want to select the rows where the life expectancy is greater than 82 years

First we use the comparison operator > on the life_expectancy column:

The result is a Series with a Boolean value for each row in the data frame indicating whether it is True or False that this row has a value above 82 in the column life_expectancy.

We can find out how many rows match this condition using the sum() method

We can use a Boolean Series as a filter to extract the rows of world which have life expectancy above 82

We can use any of the comparison operators (>, >=, <, <=, ==, !=) on a DataFrame column to create Boolean Series for filtering our data

Select data for East Asian countries:

We can filter the East Asian data further to select the year 2015:

Selecting Rows and Columns

To select rows and columns at the same time, we use the syntax .loc[<rows>, <columns>]:

Bonus: More Options for Data Selection

Select rows where the sub-region is Northern Europe and the year is 2015:

Other useful ways of subsetting data include methods such as isin(), between(), isna(), notna()

Data subsets can also be selected using row labels (index), slices (: similar to lists), and position (row and column numbers). For more details, check out this tutorial and the pandas documentation.

Exercise 1.2

a) Create a new DataFrame called americas which contains the rows of world where the region is "Americas" and has the following columns: country, year, sub_region, income_group, pop_density.

b) Use the head() and tail() methods to display the first 20 and last 20 rows.

c) Use the unique() method on the country column to display the list of unique countries in the americas DataFrame.

Creating New Columns

We can perform calculations with the columns of a DataFrame and assign the results to a new column.

Calculate total GDP by multiplying GDP per capita with population:

Compute population in millions:

Bonus: Sorting

From the summary statistics, we can see that the highest life expectancy in our data is 83.8 years, but we don't know which country and year this is.

We can find out by sorting on the life_expectancy column, from highest to lowest, and displaying the first few rows:

We can see that the highest life expectancy was Japan in 2015, followed closely by Singapore and Switzerland in the same year.

Grouping and Aggregation

Grouping and aggregation can be used to calculate statistics on groups in the data.

For simplicity, in this section we'll work with the data from year 2015 only.

Suppose we want to find the population totals in each region of our data.

Luckily, with pandas there is a better way: using aggregation to compute statistics for groups within our data.

Aggregation is a "split-apply-combine" technique:

Image credit Jake VanderPlas

Image credit Jake VanderPlas

For simple aggregations, we can use the groupby() method chained with a summary statistic (e.g., sum(), mean(), median(), max(), etc.)

We will group by region, select the pop_millions column, and take the sum:

Now let's find the highest population density in each region by aggregating the pop_density column and using max() instead of sum():

We can aggregate multiple columns at once. For example, let's find the mean life expectancy and GDP per capita in each region.

Note: For a more careful analysis, a population-weighted mean would be preferred in the above calculation, to account for the differences in population among countries. Computing a weighted mean within a pandas aggregation is a bit more involved and beyond the scope of this lesson, so we'll just use the mean.

Bonus: Fancier Aggregation

We can compute sub-totals by grouping on multiple columns:

We can use the agg method to compute multiple aggregated statistics on our data, for example minimum and maximum country populations in each region:

We can also use agg to compute different statistics for different columns:

For even more complex aggregations, there is also a pivot_table() method.

Exercise 1.3

For this exercise we're working with the original DataFrame world (containing all years).

a) Initial setup (you can skip to part b if you've already done this):

b) Group the DataFrame world by year and compute the world total population (in millions) in each year.

Break Time!

previous lesson | next lesson

home