This document is part 2 of a pandas
demo started here. Jupyter notebook source code and data are available in the Github repo.
Please email me with any questions, comments, or suggestions. I'd love to hear from you!
Continuing from Example 2 in part 1 of this demo, we first import the pandas
library and read the data from 'llc-workshop-data.csv'
into a frame. Then we parse the dates into months, so we can later compute event attendance totals by month. Finally, we rename the "Quantity" column to a more useful "# Participants".
import pandas
%matplotlib inline
events = pandas.read_csv('data/llc-workshop-data.csv')
events['date'] = pandas.to_datetime(events['Date Attending'])
events['month'] = events['date'].dt.month
events = events.rename(columns={'Quantity' : '# Participants'})
events.head()
pandas
¶When you were viewing the 'llc-workshop-data.csv'
file, you may have noticed that most of the names in the "Event Name" column included the name of the city, in a variety of different formats (e.g. "in Toronto", "(Toronto Edition)", "(Toronto)", " - Toronto"). This situation is typical in real world data—it's often messy and inconsistent, with two or more useful pieces of information smushed into one and no obvious way to extract those parts out of the whole.
It would be great to know the city associated with each event and analyze our data on a city-by-city basis. Ideally, we might have a separate file which lists additional information for each event, including the city name. In that case, we could use pandas
to easily merge that data with the data we've been working with. However, we often don't have all the extra data we might want, so in that situation it would be nice to somehow extract the city names from the event names. But how would we even do this in a spreadsheet? It is possible, but it would be quite excruciating.
In Python, however, this sort of task is much, much easier. It requires knowledge of some fairly advanced concepts, but after you've learned those concepts it's pretty straightforward to apply them to a situation like this one. In the code below, I'm using "regular expressions" from a library called re
to look for text patterns that allow me to find and extract the city names. If you'd like to learn more about regular expressions, you can check out this tutorial and this handy cheatsheet.
The code below may look overwhelming, but as before, don't worry about trying to understand the details of it—this type of text wrangling is a pretty specialized niche of the data analysis ecosystem and may not be relevant to your work (but for any of you who do need to parse information from text data, it's nice to know that this sort of thing can be done in Python). The code is taking all the event names in our data, looking for the various patterns associated with the city names ("in Toronto", "(Toronto)", etc.) and then putting the city name and a standardized version of the event name (minus the city) into two new columns of our data table.
import re
def parse_event_name(name):
"""Return a pandas Series with event name parsed into event and city"""
event, city = name, None
# Small workaround to keep some patterns from being mistaken for cities
event = event.replace('(Responsive Design)', '[Responsive Design]')
event = event.replace(' - Holiday Edition', '[Holiday Edition]')
# Check for cities listed as '(Vancouver Edition)', '(Vancouver)', '(Winnipeg Launch)', etc.
matches = re.findall('\([A-Z][a-z]+.*\)', event)
if len(matches) > 0:
event = event.replace(matches[0], '').strip()
city = matches[0].replace('(', '').replace(')', '')
city = city.replace('Edition', '').replace('Launch', '').strip()
# Check for cities listed as 'in Vancouver', 'in Toronto', etc.
matches = re.findall(' in [A-z][a-z]+', event)
if len(matches) > 0:
event = event.replace(matches[0], '').strip()
city = matches[0].replace(' in ', '').strip()
# Check for cities listed as ' - Vancouver', ' - Toronto', etc.
matches = re.findall(' - [A-z][a-z]+', event)
if len(matches) > 0:
event = event.replace(matches[0], '').strip()
city = matches[0].replace(' - ', '').strip()
# Some events are named 'Intro to' and others are 'Introduction to'
# Let's standardize them to all use the same convention
event = event.replace('Introduction to', 'Intro to')
# Return the output as a pandas Series
output = pandas.Series({'Event Name' : name,
'Event Name (standardized)' : event,
'City' : city})
return output
# Apply the above function to the entire "Event Name" column of our data table,
# to parse all the event names, and add columns to our data table with the
# standardized event name and the city name
df = events['Event Name'].apply(parse_event_name).fillna('N/A')
events = events.join(df[['Event Name (standardized)', 'City']], how='outer')
events[['Event Name', 'Event Name (standardized)', 'City']].head()
Now we have a column of city names, and with the code below we can easily check that the city names make sense, and see the number of participants for each city. When I was writing the parse_event_name
function above, I started with just one pattern to look for the city name, and then used the code below to see what other patterns were in the data, and update my parse_event_name
function with those patterns, building up my code with an iterative approach. pandas
makes it easy to build up your code in this way, checking for each inconsistency or messiness in your data, and adding a few lines of code to account for it and standardize the data into a consistent format.
city_counts = events['City'].value_counts(dropna=False)
city_counts
It looks like the code is working the way we want it to, giving us a correct list of city names. For any event name where parse_event_name
couldn't find a city, we've set the city to "N/A". In the code below, we can check all those events and make sure we didn't miss anything. This was another step I used in building up my parse_event_name
function.
no_city = list(set(events.loc[events['City'] == 'N/A', 'Event Name']))
for name in no_city:
print(name)
Looks like everything is working the way we want it to! The event names above are all missing the city name, so it is correct to list the city as "N/A" for these events.
Now that we've extracted the city names and standardized the event names, we can see the number of participants across all cities, for each event:
events['Event Name (standardized)'].value_counts()
You may recall from part 1 of the demo, that we found a total of 799 participants for National Learn to Code Day events. In the table above, we have a total of 705 participants for the event "National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website". The numbers don't quite match up, suggesting that there are multiple different event names in the "National Learn to Code Day" category. Let's explore the data to see the different "flavours" of National Learn to Code Day events, and the number of participants in each:
national = events['Event Name'].str.contains('National Learn to Code Day')
national_counts = events.loc[national, 'Event Name (standardized)'].value_counts()
print(national_counts)
print('\n*** TOTAL for National Learn to Code Day: ' + str(national_counts.sum()) + ' ***')
We can see that there were 6 different events for National Learn to Code Day, and the grand total attendance for these 6 events matches with our previous findings.
Now let's analyze the data on a city-by-city basis to extract even more insights from it. As an example, we can create a pivot table of the attendance for each event, sorted from highest total attendance to lowest, and broken down by city. Here are the first 5 rows of that pivot table:
events_by_city = events.pivot_table(values='# Participants',
index='Event Name (standardized)',
columns='City', aggfunc='sum')
events_by_city['Total'] = events_by_city.sum(axis=1)
events_by_city = events_by_city.sort_values('Total', ascending=False).fillna(0)
events_by_city.head()
Let's save events_by_city
to a .csv file so we can use it elsewhere:
events_by_city.to_csv('data/llc-workshop-attendance-by-city.csv')
Here's a sorted breakdown by city for the most popular event, "Intro to HTML & CSS":
events_by_city.loc['Intro to HTML & CSS'].sort_values(ascending=False)
Here's another view of the data—a pivot table of number of participants per month, per city:
monthly_city = events.pivot_table(values='# Participants', index='City',
columns='month', aggfunc='sum')
monthly_city = monthly_city.fillna(0)
monthly_city
That's it for part 2 of the demo! To continue where you left off in part 1, and visualize some of the above data in graphs, click here.