Lesson 2: Intro to Data Visualization

Lesson Overview

For additional resources, check out the following:

Setup

Pick up where we left off in the previous lesson:

Data Visualization Libraries

viz_libraries

Plus many, many, many more!

The Broader Landscape

viz_libraries

Image credit: Jake Vanderplas

matplotlib & seaborn

matplotlib_seaborn

plotly

plotly

Simple Plots with Pandas

If you want to quickly generate a simple plot, you can use the DataFrame's plot() method to generate a matplotlib-based plot with useful defaults and labels.

Let's use this method to create a bar chart of the total population in each world region.

The plot() method returns a matplotlib.Axes object, which is displayed as cell output. To suppress displaying this output, add a semi-colon to the end of the command.

We can create different kinds of plots using the kind keyword argument, such as scatter and line plots, histograms, and others.

Let's use the world_2015 DataFrame to create a scatter plot of life expectancy vs. GDP per capita

Statistical Plots with Seaborn

Types of Plots

Most seaborn plots fall into one of three main categories:

Relational Plots

relational

Distribution Plots

distributions

Categorical Plots

categorical

Getting Started

Let's import the seaborn library and give it the commonly used nickname sns:

Switch to seaborn default aesthetics:

Let's re-create our scatter plot from earlier using seaborn's relplot() function for relational plots

Semantic Mapping

We can easily enrich this plot with additional information from our data by mapping other variables to visual properties such as colour and size

Let's colour each point by region:

Exercise 2.1

a) Initial setup (you can skip to part b if you've already done this):

b) Use relplot() to create a scatter plot of life_expectancy vs. gdp_per_capita from world_2015, in which the points are coloured by income_group.

c) Add the keyword argument aspect=1.5 to the relplot() function call. How does the plot change?

Customize Axes

Saving a Figure

We can save our visualization in a variety of formats using the savefig() method. Let's save the previous figure (stored in the g variable) to PNG format in the figures subfolder:

After running the above command, you'll now have a PNG image file life_exp_vs_gdp_percap.png in your working directory (the same folder where your Jupyter notebook is saved). You can use this PNG file to share your visualization in a document, slideshow, web page, etc.

Note: Viewing Documentation

If you try to view the documentation with g.savefig?, there is very little information because this method is calling another method belonging to the attribute g.fig (the matplotlib figure object). To view the savefig() documentation, you can run the following command in your Jupyter notebook:

g.fig.savefig?

Note: Saving Pandas Plots

When you use the plot() method of a pandas DataFrame, as we did in the first part of this lesson, this figure can also be saved to a file, but the syntax is a bit different. Here is an example:

# Bar chart of total population for each region
ax = region_pop.plot(x='region', y='pop_millions', kind='bar');

# Save to PNG file
#  -- The bbox_inches argument is often not needed, but for this particular 
#     bar chart it's needed to prevent the labels from getting cut off.)
ax.get_figure().savefig('figures/region_populations.png', bbox_inches='tight')


These other examples show some additional syntax options.

Add Another Semantic Mapping

We can customize our scatter plot to be a "bubble plot", where the size of each marker is proportional to one of the variables in the data

Let's make the markers proportional to population size:

We've visualized four variables (gdp_per_capita, life_expectancy, region, and pop_millions) in this single two-dimensional plot!

Facets

Let's start with a simpler version of our plot:

Instead of mapping region to colours, let's now map it to facets using the col='region' keyword argument:

We can also visualize the income groups by mapping them to colours:

We can use the hue_order keyword argument to make sure the income groups are ordered properly:

Instead of manually specifying hue_order each time, we could instead convert the income_order column to a Categorical data type (and similarly for the other categorical variables: country, region, and sub-region). This would ensure the categories are automatically plotted in the correct order.

Statistical Transformations

Returning to our world DataFrame, which contains data for all years, recall that we can use grouping and aggregation to compute the total world population in each year:

Now let's see how the population has grown in each income group over time

And we can use facets to see the population growth of each income group within each region:

Exercise 2.2

a) Initial setup (you can skip to part b if you've already done this):

b) Use relplot() to create a plot similar to the previous example, but plotting life_expectancy on the y-axis instead of pop_millions and aggregating with the mean instead of the sum.

Bonus: Do you spot anything strange in the subplot for the "Americas" region? How could you investigate this using the techniques we learned in the Intro to Pandas lesson?

Bonus: Figure-Level vs. Axes-Level Functions

functions

To learn more about figure-level and axes-level functions, check out this tutorial

Bonus: Long vs. Wide Data

Most seaborn plotting functions are designed for data tables that are in long-form, rather than wide-form

long_vs_wide

In a long-form data table:

Our world data is in long-form. Let's take a subset with just the country, year, and population variables. This table contains fewer variables but is still in long-form.

In a wide-form data table, the columns and rows contain levels of different variables. We can reorganize pop_long in a couple of different ways to create a wide-form table, for example:

pop_wide contains the same data as pop_long, but the variables do not correspond to the columns, and each row contains multiple observations.

To learn more about long-form vs. wide-form data, check out this tutorial.

Categorical Plot

Interactive Plots with Plotly

First we'll import Plotly Express and give it the commonly used nickname px:

Let's recreate one of our previous scatter plots:

Saving a Figure

We can save a plotly figure as an HTML file which contains the interactive visualization. First we assign our figure object to a variable, and then use the write_html() method.

We can also save the figure as a static image such as PNG with the write_image() method. This requires some additional dependencies to be installed, as per these instructions.

Figures can also be exported to the free Chart Studio hosting service using Chart Studio's Python package or incorporated into a dashboard with Plotly Dash.

Facets

We can create facet plots:

Animations

We can easily add another variable to our plot as an animation frame

To learn more about Plotly Express, check out this tutorial.

Getting Help

Thank You!


Go to: previous lesson

home