Data Insights with Python for Beginners

Next steps and additional resources for Ladies Learning Code workshop participants

Jennifer Walker | jenfly (at) gmail (dot) com

  • This page is under active development, with frequent updates as I add more resources
  • All the Jupyter notebook source code and data are available in the Github repo
  • Data and examples on this page are adapted from the workshop materials from Ladies Learning Code

Please email me with any questions, comments, or suggestions. I'd love to hear from you!

0. Introduction

This guide is primarily geared towards folks who are already doing a lot of data analysis in Excel or Google Sheets and are considering switching over to Python for these tasks. The objectives are:

  • Give you an overview of Python's capabilities in data analysis, to help you decide if Python is the right tool for you
  • Help you set up your computer with everything you need to continue learning Python in a development environment that you can use from beginner level all the way to Python data wizard

If you're just learning Python for fun, or as an introduction to coding in general, you might want to check out the resources listed in Section 1, but for now you probably want to stick with the PyCharm setup that we used in the workshop, rather than installing any new programs. And if your primary goal with Python is more traditional software development (such as web apps, etc.) rather than data analysis, the approach I suggest in Section 3 also might not be the best setup for you, and you'll want to explore other resources that are more oriented towards traditional software development. If you're an avid data cruncher, then read on!

1. Learning Python fundamentals

There are many online Python courses which are great for beginners. Below are a few popular ones—the first two are oriented specifically towards data analysis. These courses are either free or have some free content you can sample.

The free resources below are geared towards scientists and are a bit faster paced, but still start with the basics:

For data analysis, you'll want to develop a good understanding of Python fundamentals up to about an intermediate level. This would include delving deeper into all the topics we covered in the workshop (e.g. variables, types, if statements and logic, lists, for loops, dictionaries, libraries, and reading from files) and additional topics such as: string operations, tuples, sets, namespaces, user-defined functions, file output, modules, and debugging. Depending on your Python needs, you might also want to learn about other topics such as packaging, classes, error handling, and virtual environments.

back to top

2. Python data analysis ecosystem

As we saw in the workshop, Python comes with many handy built-in functions, like print(), type(), and int(). It also comes with some handy built-in libraries that we can use— in your workshop slides, see the ones titled "Using a Library". A library (also called a "package") is a collection of Python code that is grouped together and given a name (e.g. csv) and after we import a library (e.g. import csv) we can then use all the pieces of code that are inside of it, such as functions (e.g. csv.DictReader()).

But if you're doing some serious data crunching, you're going to need additional libraries that don't automatically come with the Python that was already on your computer (if you're on a Mac) or which you downloaded from https://www.python.org/ (if you're on Windows). These are called 3rd party libraries, and they're free and available for anyone to use. There are many libraries that are useful for data analysis; three of the main ones are numpy, matplotlib, and pandas.

numpy is a numerical library that efficiently performs calculations on arrays of data. An array is a collection of variables of the same type (such as integer, float, string, etc.) that are organized into one or more dimensions. A 1-dimensional array is like a single column or row in a spreadsheet. A 2-dimensional array is like a table of rows and columns in a spreadsheet. Pretty much every data-related Python library uses numpy behind the scenes—it's the heart of number crunching in Python.

matplotlib is a data visualization library, for making graphs such as line, bar, and scatter charts, histograms, heatmaps, contour plots, and many others. It gives the user very fine control to customize all aspects of the layout and style of each graph. matplotlib is the oldest data visualization library in Python, and very widely used. Many newer visualization libraries use matplotlib behind the scenes, and build upon its features to create additional types of plots (e.g. geographic maps, advanced statistical graphs, etc.) or provide convenient sets of pre-defined styles for different aesthetic effects.

pandas is a library for working with tabular (spreadsheet-like) data. It combines the number crunching power of numpy with the visualization power of matplotlib to reproduce pretty much all the spreadsheet functionality of Excel and Google Sheets, plus a whole lot more. Compared to working in spreadsheets, pandas makes it much easier to handle messy and missing data, merge data from many different files, filter data to work with many different subsets, quickly summarize data in graphs and pivot tables with just a few lines of code, analyze timeseries data in an intuitive way, and much more.

Python combined with 3rd party libraries like these (and many others) creates a powerful, full-featured ecosystem for data analysis. The numpy, matplotlib, and pandas libraries all require a solid understanding of Python basics, so I recommend using some of the resources from Section 1, or any others you find helpful, to keep learning and practicing until you feel comfortable with Python basics at a beginner level, before diving into the larger Python data analysis ecosystem. However, even if you're just starting out with Python, you can do your future self a favour by setting up your computer with a development environment you can use now as you're starting out and throughout the rest of your journey into the larger ecosystem. In the next section, I'll walk you through the steps for this setup.

back to top

3. Setting up your computer

I always say Python is like a "choose your own adventure" because there are many different ways of doing everything—from installing it on your computer, to choosing a program for editing and running code (e.g. PyCharm vs. Jupyter), to choosing which commands and functions you use to accomplish a task. This makes it very flexible and easy to do things exactly the way you want, but it can also be overwhelming when you're just starting out and don't know where to go next!

To use the Python data analysis ecosystem, you need to install 3rd party libraries such as numpy, matplotlib, and pandas in such a way that Python can find them. This can be quite tricky, and there are a few different ways of doing it. Here I will be suggesting one approach, which is widely used in the data analysis, data science, and scientific computing communities, but this certainly isn't the only way to do things. You may find as you explore and experiment with different options, that another approach works better for you, so don't feel obligated to do everything the way I've suggested here.

All the software I recommend here is available for free.

back to top

3.1 Anaconda

If data analysis is the main reason you're using Python, I highly recommend installing Anaconda. This free software comes with everything you need to get started with data analysis, so that whenever you're ready to dive in and start learning numpy, matplotlib, pandas, etc., you can do so without having to worry about the details of how to find, install, and manage all these libraries. Anaconda is a "Python distribution", which means it includes the official Python language, like we used in the workshop, plus some extra stuff:

  • all the most common 3rd party libraries for data analysis are pre-installed so you don't need to install them yourself, and
  • it includes a package manager (conda) that you can use if you need to install any additional libraries and/or manage virtual environments.

When you download Anaconda, you'll need to select either the latest Python 3 version (currently Python 3.6) or Python 2.7—here's the first choice in the choose your own adventure! Either one is fine. I recommend selecting the latest Python 3 unless you already know that you need Python 2.7 for some reason. You can always install another Python version later if you need it, and multiple versions can co-exist peacefully in virtual environments in Anaconda without interfering with each other.

For details on installing Anaconda on your computer, please see the setup instructions from my PyLadies workshop. Once you've gone through this process to set up your computer with Anaconda, the great thing is that you'll have all the main data analysis libraries immediately available whenever you're ready to dive into them. You won't need to go searching around for the libraries you need and installing them yourself—they're already installed and ready for you!

back to top

3.2 JupyterLab

In the workshop, we used the Python console in PyCharm to type in code and run it on the fly (interactive coding). This console is okay for occasional interactive coding, but it is quite limited. Data analysis is often a very interactive process: read in data from a file, do some calculations, show the results in a table or figure, and based on what you see in the results, perform additional calculations, find new directions to explore, and so on. For this type of workflow, you might find JupyterLab an easier environment to work in compared to PyCharm.

JupyterLab is a development environment specifically designed for data analysis and scientific computing, with many enhanced features to make interactive coding much easier, and it comes pre-loaded with Anaconda, so you don't need to install any additional software to start using it.

One of the most important Jupyter tools is the Jupyter notebook, where you can run Python code and also include formatted text, equations, graphs, images, and other content. It's a very handy tool in data analysis, especially when you're exploring your data and when sharing your results with others. This guide was written as a Jupyter notebook!

You can edit and run Jupyter notebooks within JupyterLab. The JupyterLab environment also includes a CSV file viewer, text editor, and other useful tools. I'm putting together a quick-start tutorial help folks quickly get up and running with JupyterLab and Jupyter notebooks—it's currently under construction but please check the link in early September 2018! You can also edit and run Jupyter notebooks from a standalone app, as demonstrated in this tutorial.

back to top

3.3 Using PyCharm with Anaconda

If you'd like to use PyCharm with your newly installed Anaconda version of Python, you can follow the instructions in this guide. Once you've gone through this process to set up PyCharm with Anaconda, you'll have access to all the 3rd party libraries from Anaconda when you run code within PyCharm.

back to top

4. Pandas demo

As I've mentioned, one of the main libraries for data analysis is pandas. Here I'll show you examples of the cool things you can do with pandas. This section is intended only as a demo, rather than as a lesson or tutorial, so don't worry about understanding any of the code for now. For each bit of code, you can just check out the description above and the output below, to get a sense of what the code is doing.

Hopefully this demo will help you decide whether pandas is a library that might be useful to you in your work. As a caveat, there is a significant learning curve to get from Python beginner to pandas user, but going through this learning curve is likely to be a worthwhile investment if you're in any or all of the following situations:

  • Working with data that is messy and/or split up into many different files that need to be merged together
  • Using long, complicated spreadsheet formulas that are unwieldy and annoying to edit
  • Working with large amounts of data and finding that Excel/Sheets is slowing down, freezing, or crashing
  • Finding yourself doing a lot of repetitive spreadsheets tasks and wanting to automate them
  • Wanting to create customized graphs beyond the ones available in Excel/Sheets
  • ... and probably some other situtations that I haven't thought of yet!

The examples in this demo will start by reproducing the analysis we did in the workshop, to give you a sense of some of the key features of pandas and the ways that it could make your life easier. Then I will show some deeper analyses of the workshop data, getting into fairly advanced features of Python and pandas in a separate part 2, to show how powerful they are and give you an idea of what's possible with these tools (especially if you're already familiar with pivot tables and other advanced spreadsheet features). If the examples start to get too advanced or difficult to follow, don't get discouraged or overwhelmed, just skip them! You can get a lot of value out of the basic features of the pandas library, and you may not have any need for the more advanced stuff.

This demo uses a program called Jupyter notebook to combine text descriptions, code, graphs, and other output into a single "notebook". Jupyter notebook comes pre-installed with Anaconda—hurray! For more information on Jupyter notebook, check out Section 6.

So, without further ado, let's take the data that we used in our workshop, and see how pandas can help us analyze it.

back to top

Example 1: Ladies Learning Code Chapters

First we import pandas and create a variable chapters_data that stores the data from 'llc-chapters.csv'. This variable is a type that is available in pandas, called a DataFrame. When we display it here in the notebook (by putting the variable name chapters_data as the last line in the code blurb below) we can see that it's like a table from a spreadsheet or a .csv file, with data stored in rows and columns.

In [1]:
import pandas

chapters_data = pandas.read_csv('data/llc-chapters.csv')
chapters_data
Out[1]:
City Province Chapter Lead(s)
0 Vancouver BC Meghan
1 Calgary AB Darcie
2 Edmonton AB Bree & Dana
3 Saskatoon SK Brittany & Marli
4 Winnipeg MB Michelle & Jessica
5 Toronto ON Lindsay
6 Barrie ON Christine
7 Hamilton ON Meg & Abena
8 Halifax NS Christopher & MacKenzie
9 Fredericton NB Lisa
10 London ON Kelly & Jennie
11 Montreal QC Erika & Cassie
12 Ottawa ON Jasmine & Cassie
13 St Johns NL Dana
14 Victoria BC Erin & Christina
15 Waterloo ON Amandah
16 Quebec QC Guillaume & Karine

How big is our table? We can look at its shape to see that there are 17 rows and 3 columns

In [2]:
nrows, ncols = chapters_data.shape
print('There are ' + str(nrows) + ' chapters and ' + str(ncols) + ' columns of data.')
There are 17 chapters and 3 columns of data.

We can look at a single row of our table:

In [3]:
first_row = chapters_data.iloc[0].to_dict()
first_row
Out[3]:
{'Chapter Lead(s)': 'Meghan', 'City': 'Vancouver', 'Province': 'BC'}

And we can look at a single column. The numbers on the left correspond to the row numbers, called the "index" (remember that Python indexes start counting from 0).

In [4]:
leaders = chapters_data['Chapter Lead(s)']
leaders
Out[4]:
0                      Meghan
1                      Darcie
2                 Bree & Dana
3            Brittany & Marli
4          Michelle & Jessica
5                     Lindsay
6                   Christine
7                 Meg & Abena
8     Christopher & MacKenzie
9                        Lisa
10             Kelly & Jennie
11             Erika & Cassie
12           Jasmine & Cassie
13                       Dana
14           Erin & Christina
15                    Amandah
16         Guillaume & Karine
Name: Chapter Lead(s), dtype: object

As we did in exercise 5, we can find which chapters have co-leaders, count them, and display the city names. With pandas we can do this with a few lines of code and we don't need any for loops!

In [5]:
coleads = leaders.str.contains('&')
n_coleads = coleads.sum()
print('There are ' + str(n_coleads) + ' chapters with co-leads. These chapters are:')
chapters_data.loc[coleads, 'City']
There are 10 chapters with co-leads. These chapters are:
Out[5]:
2      Edmonton
3     Saskatoon
4      Winnipeg
7      Hamilton
8       Halifax
10       London
11     Montreal
12       Ottawa
14     Victoria
16       Quebec
Name: City, dtype: object

Not needing a for loop actually becomes really important when you're working with big data files. Python loops are fine for small data sets like the ones we explored in the workshop, but for much bigger data they are too slow. When you use pandas, which uses numpy behind the scenes, the code in this library is all optimized to work much faster and more efficiently with large amounts of data, compared to a Python loop.

back to top

Example 2: Canada Learning Code Events

Here we have a slightly larger data set ('llc-workshop-data.csv') and we can really start to see the power of pandas!

First we read the data into a variable events and see that it has 6487 rows and 11 columns.

In [6]:
events = pandas.read_csv('data/llc-workshop-data.csv')
events.shape
Out[6]:
(6487, 11)

Let's look at the first few rows of our data table. Anywhere you see "NaN" in the table, that stands for "not a number" and indicates an empty cell in the table (missing data).

In [7]:
events.head()
Out[7]:
Event Name Event ID Order # Order Date Quantity Ticket Type Attendee # Date Attending Order Type Gender How did you hear about this event?
0 Introduction to HTML & CSS in Toronto 9849231316 236882194 2013-12-16 1 Yes, I'd like to attend! 300683796 2014-01-08 PayPal Completed Female NaN
1 Introduction to HTML & CSS in Toronto 9849231316 236888382 2013-12-16 1 Yes, I'd like to attend! 300691338 2014-01-08 PayPal Completed Female NaN
2 Introduction to HTML & CSS in Toronto 9849231316 236916392 2013-12-16 1 Yes, I'd like to attend! 300726210 2014-01-08 PayPal Completed Female NaN
3 Introduction to HTML & CSS in Toronto 9849231316 237225952 2013-12-18 1 Yes, I'd like to attend! 301102696 2014-01-08 PayPal Completed Female NaN
4 Introduction to HTML & CSS in Toronto 9849231316 238323753 2013-12-25 1 Yes, I'd like to attend! 302451151 2014-01-08 PayPal Completed Female NaN

We can see that our data encompasses events from Jan 8, 2014 to Dec 13, 2014:

In [8]:
print('First event ' + events['Date Attending'].min())
print('Last event ' + events['Date Attending'].max())
First event 2014-01-08
Last event 2014-12-13

Each row of the events table corresponds to a participant in an event, so for a single event there are many rows which will all have the same value in the "Event Name" column. With a single line of code, and no for loops, we can obtain a sorted list of all the event names and the number of participants for each event name!

Let's check out the 20 event names with the most participants. Many of these events occurred multiple times during the year, so what we're looking at is the grand total over the year, for each event name.

In [9]:
events['Event Name'].value_counts().head(20)
Out[9]:
Intro to HTML & CSS  (Toronto)                                                                            424
WordPress for Beginners in Toronto                                                                        219
Intro to HTML & CSS in Toronto                                                                            171
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website in Toronto             169
Introduction to JavaScript  (Toronto)                                                                     108
CSS Fundamentals for Beginners in Toronto                                                                 107
Intro to Photoshop  in Toronto                                                                             95
Intro to HTML5 & Responsive Design in Toronto                                                              94
Intro to HTML & CSS: Building a One Page Website (Victoria Edition)                                        88
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website (Victoria Edition)      81
Intro to HTML & CSS (Edmonton Edition)                                                                     81
Girls Learning Code Day: Intro to HTML & CSS in Victoria! (ages 8-13)                                      79
Weekday Intro to HTML & CSS: Building a Multi-Page Website in Toronto                                      78
WordPress for Beginners  (Toronto)                                                                         76
Intro to HTML5 and Responsive Design in Toronto                                                            74
Creative Coding and Data Visualization with Processing  (Toronto)                                          66
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website (Vancouver Edition)     65
Introduction to Web Design in Toronto                                                                      64
Nov 15th: Creative Coding and Data Visualization with Processing in Toronto                                63
Intro to HTML & CSS (Vancouver Edition)                                                                    57
Name: Event Name, dtype: int64

Let's find how many people participated in "National Learn to Code Day" events. Once again, no loops required!

In [10]:
national = events['Event Name'].str.contains('National Learn to Code Day')
n_national = national.sum()
print(str(n_national) + ' attended National Learn to Code Day')
799 attended National Learn to Code Day

And now let's find the totals for "Kids Learning Code" plus "Girls Learning Code" events, and any youth-oriented "National Learn to Code Day" events:

In [11]:
kids = events['Event Name'].str.contains('Kids Learning Code')
girls = events['Event Name'].str.contains('Girls Learning Code')
youth = kids | girls
print(str(youth.sum()) + ' attended Kids Learning Code or Girls Learning Code events')

kids_national = events.loc[national, 'Event Name'].str.contains('Kids Learning Code')
girls_national = events.loc[national, 'Event Name'].str.contains('Girls Learning Code')
youth_national = kids_national | girls_national

print(str(youth_national.sum()) + ' attended youth-oriented National Learn to Code Day')
1575 attended Kids Learning Code or Girls Learning Code events
71 attended youth-oriented National Learn to Code Day

back to top

That covers the exercises we did in the workshop. Now let's see how we could use pandas to explore the data further.

When you were viewing the 'llc-workshop-data.csv' file, you may have noticed there were different categories of participants in the "Ticket Type" column. Let's see the 20 most common participant categories, and the total number of participants in each:

In [12]:
events['Ticket Type'].value_counts().head(20)
Out[12]:
Yes, I'd like to attend!                   3671
Yes, I'd like to mentor!                   1155
Register a Girl & Parent/Guardian           522
Register a girl                             261
Register to mentor                          198
Register a boy!                             172
Register a girl!                            150
Register a boy                               90
Make a Donation!                             35
Yes, I'd like to register a teen girl!       34
Register a girl (Bring Your Own Laptop)      20
Ladies Learning Code Alumni                  17
Register a boy! (Bring Your Own Laptop)      15
Yes, I'd like to volunteer!                  15
Register!                                    14
Register to Mentor                           10
Register a boy (Use Our Laptop)              10
Register a girl (Use Our Laptop)             10
Register a Girl + Parent/Guardian             9
Yes I'd like to volunteer!                    9
Name: Ticket Type, dtype: int64

Let's suppose we only want to count National Learn to Code Day participants in the "Yes I'd like to attend!" category. Here is that count:

In [13]:
ticket = "Yes, I'd like to attend!"
attend = events['Ticket Type'] == ticket
national_attend = national & attend
print(str(national_attend.sum()) + ' attended National Learn to Code Day with a ticket type: ' + ticket)
524 attended National Learn to Code Day with a ticket type: Yes, I'd like to attend!

What about patterns over time? Let's see how many participants there were each month:

In [14]:
events['date'] = pandas.to_datetime(events['Date Attending'])
events['month'] = events['date'].dt.month
monthly = events[['Quantity', 'month']].groupby('month').sum()
monthly = monthly.rename(columns = {'Quantity' : '# Participants'})
monthly
Out[14]:
# Participants
month
1 509
2 649
3 505
4 708
5 435
6 425
7 261
8 155
9 981
10 482
11 1101
12 276

It would be nice to see this data as a graph. We can do this with just a couple of lines of code. The %matplotlib inline line is a special command that tells Jupyter notebook to display graphs inline in the notebook.

In [15]:
%matplotlib inline

monthly.plot.bar(title='Monthly Participants');

How about monthly totals for the categories we looked at earlier?

In [16]:
categories = ['National Learn to Code Day', 'Kids Learning Code', 'Girls Learning Code']
monthly_cat = pandas.DataFrame(index=range(1, 13), columns=categories)
monthly_cat.index.name = 'month'
for category in categories:
    data = events[events['Event Name'].str.contains(category)]
    monthly_cat[category] = data[['Quantity', 'month']].groupby('month').sum()['Quantity']
monthly_cat = monthly_cat.fillna(0)
monthly_cat
Out[16]:
National Learn to Code Day Kids Learning Code Girls Learning Code
month
1 0.0 18.0 14.0
2 0.0 59.0 34.0
3 0.0 47.0 25.0
4 0.0 61.0 24.0
5 0.0 50.0 70.0
6 0.0 53.0 49.0
7 0.0 13.0 0.0
8 0.0 0.0 0.0
9 798.0 62.0 73.0
10 0.0 31.0 37.0
11 1.0 128.0 603.0
12 0.0 87.0 37.0

Looks pretty good, except you might notice something strange... National Learn to Code Day was in September, but there is one participant listed for it in November. What's going on? With pandas we can easily investigate:

In [17]:
wonky = events['Event Name'].str.contains('National Learn to Code Day')
wonky = wonky & (events['month'] == 11)
events[wonky]
Out[17]:
Event Name Event ID Order # Order Date Quantity Ticket Type Attendee # Date Attending Order Type Gender How did you hear about this event? date month
5279 National Learn to Code Day: Intro to HTML & CS... 13209711603 356187119 2014-10-10 1 Register to mentor 450472457 2014-11-08 Free Order NaN Other 2014-11-08 11

Sure enough, here is one participant listed for a National Learn to Code Day event, but with "Date Attending" as November 8. Let's check the rows that have same Event ID, which should correspond to the same specific event. Here are the first 10 of those rows:

In [18]:
event_id = int(events.loc[wonky, 'Event ID'])
events[events['Event ID'] == event_id].head(10)
Out[18]:
Event Name Event ID Order # Order Date Quantity Ticket Type Attendee # Date Attending Order Type Gender How did you hear about this event? date month
5274 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 352026737 2014-10-02 1 Register a Girl & Parent/Guardian 445287897 2014-11-08 Free Order NaN Other 2014-11-08 11
5275 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 353034505 2014-10-05 1 Register a Girl & Parent/Guardian 446552415 2014-11-08 Free Order NaN From a Friend 2014-11-08 11
5276 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 353034505 2014-10-05 1 Register a Girl & Parent/Guardian 446552419 2014-11-08 Free Order NaN Other 2014-11-08 11
5277 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 354093537 2014-10-07 1 Register a Girl & Parent/Guardian 447869453 2014-11-08 Free Order NaN From a Friend 2014-11-08 11
5278 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 354987441 2014-10-08 1 Register a Girl & Parent/Guardian 448978279 2014-11-08 Free Order NaN From a Friend 2014-11-08 11
5279 National Learn to Code Day: Intro to HTML & CS... 13209711603 356187119 2014-10-10 1 Register to mentor 450472457 2014-11-08 Free Order NaN Other 2014-11-08 11
5280 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 356406519 2014-10-11 1 Register to mentor 450742683 2014-11-08 Free Order NaN Other 2014-11-08 11
5281 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 356875759 2014-10-12 1 Register a Girl & Parent/Guardian 451322811 2014-11-08 Free Order NaN Other 2014-11-08 11
5282 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 357160335 2014-10-13 1 Register a Girl & Parent/Guardian 451665465 2014-11-08 Free Order NaN From a Friend 2014-11-08 11
5283 Girls Learning Code Day: Intro to HTML & CSS i... 13209711603 357322717 2014-10-13 1 Register a Girl & Parent/Guardian 451864891 2014-11-08 Free Order NaN Other 2014-11-08 11

We can see that in row 5279, the event is listed as a "National Learn to Code Day" event, but the other entries for this event ID are listed as a "Girls Learning Code Day" event, suggesting an error might have somehow crept in when the data was entered. Since pandas allows us to quickly summarize the data from many different angles, such as monthly totals above, it becomes much easier to spot inconsistencies and possible errors.

Now let's check out these monthly totals by category as a bar chart:

In [19]:
monthly_cat.plot.bar(title='Monthly Participants');

You may have noticed in the code above, that we didn't include any commands to specify the x-axis label or the names of the series in the legend. When you create a graph in pandas, it automatically labels things for you, based on the row and column names of your data table. You can change the labels if you want (for example, make the x-axis label "Month" instead of "month"), but when you're first exploring your data it's really handy to be able to quickly generate graphs with all the labels automatically created for you.

How about the number of participants at each event? Here's what that looks like:

In [20]:
att_by_event = events.groupby('Event ID').sum()['Quantity']
print(str(att_by_event.max()) + ' attended the biggest event')
print(str(att_by_event.min()) + ' attended the smallest event')
print(str(round(att_by_event.mean(), 1)) + ' was the average attendance per event')
169 attended the biggest event
1 attended the smallest event
33.1 was the average attendance per event

Now let's look at totals by gender. Looking at participants in all the events, we see that they are about 32% female, 6% male, and 62% not listed.

In [21]:
genders = events['Gender'].value_counts(dropna=False).to_frame(name='# Participants')
genders = genders.set_index(genders.index.fillna('Not listed'))
genders['% Participants'] = 100 * genders['# Participants'] / genders['# Participants'].sum()
genders
Out[21]:
# Participants % Participants
Not listed 4024 62.031756
Female 2100 32.372437
Male 363 5.595807

And there are many other things we could to do explore this data further. To dive even deeper into pandas and explore some of its more advanced features, you can check out part 2 of this demo.

Hopefully this section has given you an idea of whether pandas would be a good tool to use in your own data analysis!

back to top

5. Data visualization demo

There are many, many different data visualization libraries for making graphs and images in Python. In this section, I'll show examples of the kinds of graphs you can make with pandas (which uses matplotlib behind the scenes for creating graphs) and another popular library seaborn, which is also built on matplotlib. For many more examples of visualizations that can be created in these libraries, you can scroll through the images in these galleries:

The above libraries are just a tiny fraction of the data visualization world in Python, which is a whole ecosystem of its own. Another important cluster of libraries within this ecosystem are libraries for creating interactive web-based plots. Two popular ones are bokeh and plotly, but there are many others.

For a more complete picture of the different visualization tools available in Python, check out this great talk by Jake VanderPlas.

This section is intended only as a demo, rather than as a lesson or tutorial, so don't worry about understanding any of the code for now. For each bit of code, just check out the description above and the output below, to get a sense of what the code is doing.

We'll start out by importing some of the libraries we'll use in this section, and renaming them to shorthand names (np, plt, pd, and sns). These are standard shorthands that you'll likely see throughout online tutorials and documentation for these libraries.

In [22]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Next we read in some data from part 2 of the pandas demo, which gives a breakdown by city of the Canada Learning Code events in 2014:

In [23]:
events_by_city = pd.read_csv('data/llc-workshop-attendance-by-city.csv', index_col=0)
events_by_city.head()
Out[23]:
Barrie Calgary Edmonton Fredericton Halifax Hamilton Kitchener/Waterloo London Montreal N/A ... Saskatoon St. John's Sydney Toronto Vancouver Victoria Waterloo Whitehorse Winnipeg Total
Event Name (standardized)
Intro to HTML & CSS 0.0 40.0 81.0 0.0 0.0 30.0 51.0 28.0 53.0 0.0 ... 57.0 36.0 0.0 668.0 57.0 0.0 0.0 0.0 37.0 1150.0
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website 29.0 41.0 43.0 17.0 50.0 34.0 0.0 0.0 51.0 0.0 ... 45.0 0.0 0.0 169.0 65.0 81.0 35.0 0.0 0.0 705.0
WordPress for Beginners 28.0 0.0 38.0 0.0 28.0 0.0 0.0 0.0 37.0 0.0 ... 0.0 38.0 0.0 295.0 52.0 0.0 0.0 0.0 0.0 568.0
Girls Learning Code Day: Intro to HTML & CSS! (ages 8-13) 29.0 0.0 0.0 31.0 45.0 26.0 0.0 18.0 38.0 0.0 ... 29.0 0.0 0.0 0.0 33.0 79.0 30.0 0.0 54.0 454.0
Intro to JavaScript 0.0 19.0 0.0 27.0 0.0 24.0 0.0 0.0 51.0 0.0 ... 26.0 0.0 0.0 188.0 46.0 0.0 0.0 0.0 0.0 413.0

5 rows × 23 columns

Here's a stacked bar chart of the breakdown by city for the five most popular Canada Learning Code events in 2014 (from the example data we worked with in the previous section):

In [24]:
top_five = events_by_city.head(5).drop('Total', axis=1).T
top_five.plot.bar(stacked=True, figsize=(10, 4))
plt.ylabel('# Participants')
plt.legend(loc='upper left', frameon=False);

We can show the data as a horizontal stacked bar chart instead. Let's also sort the cities by the total number of participants across these top 5 events, and add some gridlines.

In [25]:
sorted_top_five = top_five.copy()
sorted_top_five['Total'] = sorted_top_five.sum(axis=1)
sorted_top_five = sorted_top_five.sort_values('Total', ascending=True).drop('Total', axis=1)
with sns.axes_style('darkgrid'):
    sorted_top_five.plot.barh(stacked=True, figsize=(10, 7))
    plt.xlabel('# Participants')
    plt.legend(loc='center right')

Here's an example of a line chart, using some fake timeseries data:

In [26]:
dates = pd.date_range('2017-01-01', '2017-12-31')
ts = pd.DataFrame(np.random.randn(len(dates), 3), index=dates, columns=['A', 'B', 'C'])
ts = ts.cumsum()
ts.plot(figsize=(8, 3));

We can also plot these as separate line charts in subplots:

In [27]:
ts.plot(figsize=(8, 6), subplots=True);

Here's an example of creating a plot in the seaborn library, which can quickly generate graphs for many common statistical analyses.

We'll use a famous data set, the Fisher's iris data, which is a set of measurements of the petals and sepals of three species of Iris flowers. This data is available as a built-in example that can be loaded from seaborn.

In [28]:
iris = sns.load_dataset('iris')
iris.head()
Out[28]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Let's look at distributions and relationships between measurements. With a single line of code, we can create a set of scatter plots and histograms for all four variables:

In [29]:
sns.pairplot(iris, hue="species", size=2);

seaborn also provides a variety of pre-configured styles that can be applied to quickly change the look of a graph or a whole collection of graphs. Here's an example of applying a different style to the above graph:

In [30]:
palette = sns.color_palette('deep')
with sns.axes_style('darkgrid'):
    sns.pairplot(iris, hue="species", size=2, palette=palette);

The example graphs in this section could easily be customized in many other ways. To get a sense of these customizations, and of the other kinds of graphs that can be created with these libraries, check out the galleries linked at the start of this section.

We've seen a very small sampling of Python's data visualization capabilities. There are many more libraries you can explore, which have been developed for a wide variety of purposes, such as maps and geographic data, 3-dimensional visualizations, interactive graphs for web pages, and more!

back to top

6. Additional resources

When you've become comfortable with the fundamentals of Python and want to start learning numpy, matplotlib, pandas and other data analysis libraries, there are many resources available to choose from. Of the sites I listed in Section 1, DataCamp and DataQuest both offer online courses covering these topics in depth, and there are many other online learning platforms with in-depth courses on these topics as well. You can also search for tutorials for each library to find many helpful resources.

If you'd like to learn more about the Python data analysis ecosystem, such as how all the pieces fit together, which tools might be useful for you, how Python evolved into the data crunching powerhouse that it is today, and some of the challenges for new users navigating the Python world, I highly recommend this fantastic keynote talk by Jake VanderPlas from the PyData Seattle 2017 conference.

Another great resource from Jake VanderPlas is his Youtube video series "Reproducible Data Analysis in Jupyter". Part 1 and part 2 give a really nice demonstration of how to use pandas and Jupyter notebook to analyze patterns in a dataset of hourly counts of bike trips across the Fremont bridge in Seattle. If you're at a more intermediate or advanced level with Python, you might want to check out the rest of the videos in the 10-part series for all sorts of great additional tips and tricks. If you're familiar with principal component analysis and unsupervised clustering, check out part 10 for a very neat analysis showing how the bike trip data can be used to distinguish regular weekdays from weekends, holidays, and even a big snow storm!

For some more background on why I think Jupyter is so awesome for data analysis, you can check out this great talk by Fernando Perez, who was the original creator of these tools and now leads a large team of developers at Berkeley, working on all kinds of cool new features for the Jupyter project. I especially like his philosophy about "human-centered" interactive software for scientific computing and data analysis, and the ethical importance of free, open-source tools to make science accessible to folks all around the world who may not all have budgets for expensive proprietary software.

I like learning from books and when I was learning Python I found this book, by Wes McKinney, creator of pandas, very helpful. It has a companion site with data and Jupyter notebooks corresponding to all the examples in the book, so you can work through them yourself.

If you really want to geek out, check out the Zen of Python here or type import this into the IPython shell or PyCharm console.

Hopefully this guide will help you decide if Python is a good fit for the work you're doing, and if so, will help you navigate the world of Python data analysis and give you some ideas of your next steps and path forward. If you have any questions, comments, suggestions, or any other feedback, please send me an email—I'd love to hear from you. Good luck in your Python journey!

back to top