Pandas Demo: Part 2

Advanced features and fancy stuff

Jennifer Walker | jenfly (at) gmail (dot) com

This document is part 2 of a pandas demo started here. Jupyter notebook source code and data are available in the Github repo.

Please email me with any questions, comments, or suggestions. I'd love to hear from you!

Example 2 cont'd: Deeper Analysis of Canada Learning Code Events

Continuing from Example 2 in part 1 of this demo, we first import the pandas library and read the data from 'llc-workshop-data.csv' into a frame. Then we parse the dates into months, so we can later compute event attendance totals by month. Finally, we rename the "Quantity" column to a more useful "# Participants".

In [1]:
import pandas
%matplotlib inline

events = pandas.read_csv('data/llc-workshop-data.csv')
events['date'] = pandas.to_datetime(events['Date Attending'])
events['month'] = events['date'].dt.month
events = events.rename(columns={'Quantity' : '# Participants'})
events.head()
Out[1]:
Event Name Event ID Order # Order Date # Participants Ticket Type Attendee # Date Attending Order Type Gender How did you hear about this event? date month
0 Introduction to HTML & CSS in Toronto 9849231316 236882194 2013-12-16 1 Yes, I'd like to attend! 300683796 2014-01-08 PayPal Completed Female NaN 2014-01-08 1
1 Introduction to HTML & CSS in Toronto 9849231316 236888382 2013-12-16 1 Yes, I'd like to attend! 300691338 2014-01-08 PayPal Completed Female NaN 2014-01-08 1
2 Introduction to HTML & CSS in Toronto 9849231316 236916392 2013-12-16 1 Yes, I'd like to attend! 300726210 2014-01-08 PayPal Completed Female NaN 2014-01-08 1
3 Introduction to HTML & CSS in Toronto 9849231316 237225952 2013-12-18 1 Yes, I'd like to attend! 301102696 2014-01-08 PayPal Completed Female NaN 2014-01-08 1
4 Introduction to HTML & CSS in Toronto 9849231316 238323753 2013-12-25 1 Yes, I'd like to attend! 302451151 2014-01-08 PayPal Completed Female NaN 2014-01-08 1
Now, let's get even more fancy-schmancy to showcase some of the superpowers of Python + pandas

When you were viewing the 'llc-workshop-data.csv' file, you may have noticed that most of the names in the "Event Name" column included the name of the city, in a variety of different formats (e.g. "in Toronto", "(Toronto Edition)", "(Toronto)", " - Toronto"). This situation is typical in real world data—it's often messy and inconsistent, with two or more useful pieces of information smushed into one and no obvious way to extract those parts out of the whole.

It would be great to know the city associated with each event and analyze our data on a city-by-city basis. Ideally, we might have a separate file which lists additional information for each event, including the city name. In that case, we could use pandas to easily merge that data with the data we've been working with. However, we often don't have all the extra data we might want, so in that situation it would be nice to somehow extract the city names from the event names. But how would we even do this in a spreadsheet? It is possible, but it would be quite excruciating.

In Python, however, this sort of task is much, much easier. It requires knowledge of some fairly advanced concepts, but after you've learned those concepts it's pretty straightforward to apply them to a situation like this one. In the code below, I'm using "regular expressions" from a library called re to look for text patterns that allow me to find and extract the city names. If you'd like to learn more about regular expressions, you can check out this tutorial and this handy cheatsheet.

The code below may look overwhelming, but as before, don't worry about trying to understand the details of it—this type of text wrangling is a pretty specialized niche of the data analysis ecosystem and may not be relevant to your work (but for any of you who do need to parse information from text data, it's nice to know that this sort of thing can be done in Python). The code is taking all the event names in our data, looking for the various patterns associated with the city names ("in Toronto", "(Toronto)", etc.) and then putting the city name and a standardized version of the event name (minus the city) into two new columns of our data table.

In [2]:
import re

def parse_event_name(name):
    """Return a pandas Series with event name parsed into event and city"""
    event, city = name, None
    
    # Small workaround to keep some patterns from being mistaken for cities
    event = event.replace('(Responsive Design)', '[Responsive Design]')
    event = event.replace(' - Holiday Edition', '[Holiday Edition]')

    # Check for cities listed as '(Vancouver Edition)', '(Vancouver)', '(Winnipeg Launch)', etc.
    matches = re.findall('\([A-Z][a-z]+.*\)', event)
    if len(matches) > 0:
        event = event.replace(matches[0], '').strip()      
        city = matches[0].replace('(', '').replace(')', '')
        city = city.replace('Edition', '').replace('Launch', '').strip()
        
    # Check for cities listed as 'in Vancouver', 'in Toronto', etc.
    matches = re.findall(' in [A-z][a-z]+', event)
    if len(matches) > 0:
        event = event.replace(matches[0], '').strip()
        city = matches[0].replace(' in ', '').strip()
   
    # Check for cities listed as ' - Vancouver', ' - Toronto', etc.
    matches = re.findall(' - [A-z][a-z]+', event)
    if len(matches) > 0:
        event = event.replace(matches[0], '').strip()
        city = matches[0].replace(' - ', '').strip()
    
    # Some events are named 'Intro to' and others are 'Introduction to'
    # Let's standardize them to all use the same convention
    event = event.replace('Introduction to', 'Intro to')
    
    # Return the output as a pandas Series
    output = pandas.Series({'Event Name' : name,
                            'Event Name (standardized)' : event,
                            'City' : city})
    return output

# Apply the above function to the entire "Event Name" column of our data table,
# to parse all the event names, and add columns to our data table with the 
# standardized event name and the city name
df = events['Event Name'].apply(parse_event_name).fillna('N/A')
events = events.join(df[['Event Name (standardized)', 'City']], how='outer')
events[['Event Name', 'Event Name (standardized)', 'City']].head()
Out[2]:
Event Name Event Name (standardized) City
0 Introduction to HTML & CSS in Toronto Intro to HTML & CSS Toronto
1 Introduction to HTML & CSS in Toronto Intro to HTML & CSS Toronto
2 Introduction to HTML & CSS in Toronto Intro to HTML & CSS Toronto
3 Introduction to HTML & CSS in Toronto Intro to HTML & CSS Toronto
4 Introduction to HTML & CSS in Toronto Intro to HTML & CSS Toronto

Now we have a column of city names, and with the code below we can easily check that the city names make sense, and see the number of participants for each city. When I was writing the parse_event_name function above, I started with just one pattern to look for the city name, and then used the code below to see what other patterns were in the data, and update my parse_event_name function with those patterns, building up my code with an iterative approach. pandas makes it easy to build up your code in this way, checking for each inconsistency or messiness in your data, and adding a few lines of code to account for it and standardize the data into a consistent format.

In [3]:
city_counts = events['City'].value_counts(dropna=False)
city_counts
Out[3]:
Toronto               2669
Vancouver              500
N/A                    427
Calgary                418
Montreal               337
Edmonton               308
Victoria               296
Ottawa                 264
Halifax                216
Saskatoon              176
Hamilton               171
Barrie                 168
Fredericton            115
Winnipeg                91
Waterloo                75
St. John's              74
London                  67
Kitchener/Waterloo      51
Saint John              23
Sydney                  15
Whitehorse              14
Regina                  12
Name: City, dtype: int64

It looks like the code is working the way we want it to, giving us a correct list of city names. For any event name where parse_event_name couldn't find a city, we've set the city to "N/A". In the code below, we can check all those events and make sure we didn't miss anything. This was another step I used in building up my parse_event_name function.

In [4]:
no_city = list(set(events.loc[events['City'] == 'N/A', 'Event Name']))
for name in no_city:
    print(name)
Kids Learning Code: Inventors After-School Program
Girls Learning Code: Generative Art with Processing
Kids Learning Code: Intro to Processing and Arduino for Teens
Kids Learning Code: Webmaking With HTML & CSS for 6-8 year olds
Kids Learning Code PA Day: HTML & CSS with Mozilla Thimble
Girls Learning Code: Intro to Arduino
Kids Learning Code PA Day: Game Design with Scratch
Kids Learning Code: Making Online Games with Scratch
Kids Learning Code PA Day Introduction to HTML & CSS
Kids Learning Code: Graphics for Games with Pixlr
Kids Learning Code: Intro to Programming with Python
Kids Learning Code: Intro to Arduino for Father's Day
Kids Learning Code: Generative Art with Processing for Mother's Day
Kids Learning Code: Intro to Python For Teens
Kids Learning Code: Graphic Design with Pixlr
Kids Learning Code at Williamsburg Public School (ages 8-13)
Kids Learning Code: Intro to Toy Hacking
Girls Learning Code: CSS with Sublime Text for 9-12 year olds
Girls Learning Code: Intro to Programming with Ruby For Teens
Kids Learning Code PA Day: Stop Motion Animation
Kid Learning Code: Intro to Photoshop For Teens at George Brown College
Girls Learning Code: Digital Drawing at George Brown College
Kids Learning Code at Nelson Mandela (ages 8-13)
Kids Learning Code: After School Program
Kids Learning Code: Introduction to Video Creation for 6-8 year olds
Kids Learning Code: Intro to Photo Editing
Kids Learning Code: Introduction to Makey Makey
Kids Learning Code: Game Design with Scratch ages 9-12

Looks like everything is working the way we want it to! The event names above are all missing the city name, so it is correct to list the city as "N/A" for these events.

Now that we've extracted the city names and standardized the event names, we can see the number of participants across all cities, for each event:

In [5]:
events['Event Name (standardized)'].value_counts()
Out[5]:
Intro to HTML & CSS                                                                               1150
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website                 705
WordPress for Beginners                                                                            568
Girls Learning Code Day: Intro to HTML & CSS! (ages 8-13)                                          454
Intro to JavaScript                                                                                413
Intro to Python                                                                                    154
Intro to Photoshop                                                                                 146
Intro to HTML5 & Responsive Design                                                                 130
CSS Fundamentals for Beginners                                                                     121
CSS Fundamentals                                                                                   112
Intro to Web Design                                                                                106
Intro to Web Design with Photoshop                                                                 106
Girls Learning Code Mother- Daughter HTML & CSS                                                     99
Intro to HTML & CSS: Building a One Page Website                                                    88
Intro to Ruby                                                                                       82
Weekday Intro to HTML & CSS: Building a Multi-Page Website                                          78
Intro to jQuery                                                                                     78
Intro to HTML5 and Responsive Design                                                                74
Intro to Photoshop & Illustrator                                                                    68
Creative Coding and Data Visualization with Processing                                              66
Nov 15th: Creative Coding and Data Visualization with Processing                                    63
Intro to Drupal                                                                                     59
Intro to Photoshop and Illustrator                                                                  51
Girls Learning Code: Intro to Arduino                                                               50
Intro to Mobile Web                                                                                 49
July 21 & 23: WordPress for Beginners                                                               47
Ladies Learning Code Instructor & Mentor Appreciation Night                                         47
Girls Learning Code Day (ages 8-13): Intro to HTML & CSS                                            46
Girls Learning Code Day: Intro to HTML & CSS for Teens! (ages 13-17)                                43
Intro to Mobile Web --> HTML5 and Responsive Design                                                 38
                                                                                                  ... 
Kids Learning Code: Intro to Video Creation for 6-8 year olds                                       16
Intro to Mobile Web --> HTML5 & Responsive Design                                                   16
Kids Learning Code: Graphic Design with Pixlr                                                       16
Kids Learning Code: Intro to Programming with Python                                                15
Kids Learning Code (ages 6-8): Photo Editing with Pixlr                                             15
Girls Learning Code Day: Intro to HTML & CSS, NS! (ages 8-13)                                       15
Kid Learning Code: Intro to Photoshop For Teens at George Brown College                             15
Girls Learning Code: Generative Art with Processing                                                 14
Kids Learning Code: Intro to Processing and Arduino for Teens                                       14
Girls Learning Code Day: Intro to HTML & CSS! (ages 8-13 & 13-17)                                   14
National Learn to Code Day 2014 Kids Learning Code Edition for Teens 13-16 Intro to HTML & CSS      14
Girls Learning Code (ages 9-12): Intro to Ruby                                                      14
Intro to Mobile Web/ Responsive Web Design                                                          13
Girls Learning Code (ages 9-12): Storytelling with Multimedia                                       12
Kids Learning Code (ages 13-16): Sound Design with Reaper and Fmod Studio                           11
Kids Learning Code Game Design                                                                      11
Girls Learning Code: Intro to Programming with Ruby For Teens                                       11
Kids Learning Code: After School Program                                                            10
Girls Learning Code: CSS with Sublime Text for 9-12 year olds                                       10
Kids Learning Code: Generative Art with Processing for Mother's Day                                  9
Kids Learning Code: Intro to Arduino for Father's Day                                                8
Kids Learning Code: Webmaking With HTML & CSS for 6-8 year olds                                      7
Kids Learning Code: Game Design with Scratch ages 9-12                                               6
Girls Learning Code Intro to HTML & CSS for Teens                                                    6
Kids Learning Code: Intro to Photo Editing                                                           6
Kids Learning Code: Inventors After-School Program                                                   5
Girls Learning Code: Intro to HTML & CSS for Teens                                                   5
Family Information Night                                                                             1
National Learn to Code Day: Intro to HTML & CSS! (ages 8-13)                                         1
Kids Learning Code PA Day: Stop Motion Animation                                                     1
Name: Event Name (standardized), Length: 97, dtype: int64

You may recall from part 1 of the demo, that we found a total of 799 participants for National Learn to Code Day events. In the table above, we have a total of 705 participants for the event "National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website". The numbers don't quite match up, suggesting that there are multiple different event names in the "National Learn to Code Day" category. Let's explore the data to see the different "flavours" of National Learn to Code Day events, and the number of participants in each:

In [6]:
national = events['Event Name'].str.contains('National Learn to Code Day')
national_counts = events.loc[national, 'Event Name (standardized)'].value_counts()
print(national_counts)
print('\n*** TOTAL for National Learn to Code Day: ' + str(national_counts.sum()) + ' ***')
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website                705
National Learn to Code Day 2014 Kids Learning Code 9-12 Edition Intro to HTML & CSS                32
National Learn to Code Day: Girls Learning Code                                                    25
CSS Fundamentals on National Learn to Code Day                                                     22
National Learn to Code Day 2014 Kids Learning Code Edition for Teens 13-16 Intro to HTML & CSS     14
National Learn to Code Day: Intro to HTML & CSS! (ages 8-13)                                        1
Name: Event Name (standardized), dtype: int64

*** TOTAL for National Learn to Code Day: 799 ***

We can see that there were 6 different events for National Learn to Code Day, and the grand total attendance for these 6 events matches with our previous findings.

Now let's analyze the data on a city-by-city basis to extract even more insights from it. As an example, we can create a pivot table of the attendance for each event, sorted from highest total attendance to lowest, and broken down by city. Here are the first 5 rows of that pivot table:

In [7]:
events_by_city = events.pivot_table(values='# Participants', 
                                    index='Event Name (standardized)',
                                    columns='City', aggfunc='sum')
events_by_city['Total'] = events_by_city.sum(axis=1)
events_by_city = events_by_city.sort_values('Total', ascending=False).fillna(0)
events_by_city.head()
Out[7]:
City Barrie Calgary Edmonton Fredericton Halifax Hamilton Kitchener/Waterloo London Montreal N/A ... Saskatoon St. John's Sydney Toronto Vancouver Victoria Waterloo Whitehorse Winnipeg Total
Event Name (standardized)
Intro to HTML & CSS 0.0 40.0 81.0 0.0 0.0 30.0 51.0 28.0 53.0 0.0 ... 57.0 36.0 0.0 668.0 57.0 0.0 0.0 0.0 37.0 1150.0
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website 29.0 41.0 43.0 17.0 50.0 34.0 0.0 0.0 51.0 0.0 ... 45.0 0.0 0.0 169.0 65.0 81.0 35.0 0.0 0.0 705.0
WordPress for Beginners 28.0 0.0 38.0 0.0 28.0 0.0 0.0 0.0 37.0 0.0 ... 0.0 38.0 0.0 295.0 52.0 0.0 0.0 0.0 0.0 568.0
Girls Learning Code Day: Intro to HTML & CSS! (ages 8-13) 29.0 0.0 0.0 31.0 45.0 26.0 0.0 18.0 38.0 0.0 ... 29.0 0.0 0.0 0.0 33.0 79.0 30.0 0.0 54.0 454.0
Intro to JavaScript 0.0 19.0 0.0 27.0 0.0 24.0 0.0 0.0 51.0 0.0 ... 26.0 0.0 0.0 188.0 46.0 0.0 0.0 0.0 0.0 413.0

5 rows × 23 columns

Let's save events_by_city to a .csv file so we can use it elsewhere:

In [8]:
events_by_city.to_csv('data/llc-workshop-attendance-by-city.csv')

Here's a sorted breakdown by city for the most popular event, "Intro to HTML & CSS":

In [9]:
events_by_city.loc['Intro to HTML & CSS'].sort_values(ascending=False)
Out[9]:
City
Total                 1150.0
Toronto                668.0
Edmonton                81.0
Vancouver               57.0
Saskatoon               57.0
Montreal                53.0
Kitchener/Waterloo      51.0
Calgary                 40.0
Winnipeg                37.0
St. John's              36.0
Hamilton                30.0
London                  28.0
Regina                  12.0
N/A                      0.0
Ottawa                   0.0
Saint John               0.0
Sydney                   0.0
Victoria                 0.0
Halifax                  0.0
Fredericton              0.0
Waterloo                 0.0
Whitehorse               0.0
Barrie                   0.0
Name: Intro to HTML & CSS, dtype: float64

Here's another view of the data—a pivot table of number of participants per month, per city:

In [10]:
monthly_city = events.pivot_table(values='# Participants', index='City',
                                  columns='month', aggfunc='sum')
monthly_city = monthly_city.fillna(0)
monthly_city
Out[10]:
month 1 2 3 4 5 6 7 8 9 10 11 12
City
Barrie 28.0 0.0 16.0 0.0 20.0 13.0 0.0 0.0 46.0 0.0 29.0 16.0
Calgary 30.0 28.0 52.0 40.0 19.0 36.0 47.0 0.0 77.0 19.0 49.0 21.0
Edmonton 81.0 37.0 38.0 25.0 0.0 40.0 0.0 0.0 43.0 17.0 27.0 0.0
Fredericton 0.0 27.0 0.0 20.0 20.0 0.0 0.0 0.0 17.0 0.0 31.0 0.0
Halifax 0.0 28.0 0.0 26.0 0.0 21.0 0.0 0.0 50.0 25.0 66.0 0.0
Hamilton 0.0 30.0 24.0 42.0 15.0 0.0 0.0 0.0 34.0 0.0 26.0 0.0
Kitchener/Waterloo 0.0 0.0 0.0 51.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
London 0.0 39.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0 18.0 0.0
Montreal 53.0 51.0 49.0 43.0 15.0 37.0 0.0 0.0 51.0 0.0 38.0 0.0
N/A 32.0 92.0 47.0 85.0 50.0 53.0 13.0 0.0 0.0 5.0 18.0 32.0
Ottawa 52.0 5.0 0.0 16.0 40.0 0.0 32.0 0.0 45.0 0.0 74.0 0.0
Regina 0.0 0.0 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Saint John 0.0 0.0 0.0 15.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0
Saskatoon 57.0 0.0 19.0 0.0 0.0 26.0 0.0 0.0 45.0 0.0 29.0 0.0
St. John's 0.0 36.0 0.0 24.0 0.0 14.0 0.0 0.0 0.0 0.0 0.0 0.0
Sydney 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15.0 0.0
Toronto 124.0 219.0 217.0 268.0 225.0 146.0 169.0 155.0 367.0 246.0 374.0 159.0
Vancouver 52.0 57.0 31.0 53.0 21.0 21.0 0.0 0.0 90.0 45.0 130.0 0.0
Victoria 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 81.0 88.0 79.0 48.0
Waterloo 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 35.0 0.0 30.0 0.0
Whitehorse 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14.0 0.0
Winnipeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 37.0 54.0 0.0

That's it for part 2 of the demo! To continue where you left off in part 1, and visualize some of the above data in graphs, click here.