Please email me with any questions, comments, or suggestions. I'd love to hear from you!
This guide is primarily geared towards folks who are already doing a lot of data analysis in Excel or Google Sheets and are considering switching over to Python for these tasks. The objectives are:
If you're just learning Python for fun, or as an introduction to coding in general, you might want to check out the resources listed in Section 1, but for now you probably want to stick with the PyCharm setup that we used in the workshop, rather than installing any new programs. And if your primary goal with Python is more traditional software development (such as web apps, etc.) rather than data analysis, the approach I suggest in Section 3 also might not be the best setup for you, and you'll want to explore other resources that are more oriented towards traditional software development. If you're an avid data cruncher, then read on!
There are many online Python courses which are great for beginners. Below are a few popular ones—the first two are oriented specifically towards data analysis. These courses are either free or have some free content you can sample.
The free resources below are geared towards scientists and are a bit faster paced, but still start with the basics:
For data analysis, you'll want to develop a good understanding of Python fundamentals up to about an intermediate level. This would include delving deeper into all the topics we covered in the workshop (e.g. variables, types, if
statements and logic, lists, for
loops, dictionaries, libraries, and reading from files) and additional topics such as: string operations, tuples, sets, namespaces, user-defined functions, file output, modules, and debugging. Depending on your Python needs, you might also want to learn about other topics such as packaging, classes, error handling, and virtual environments.
As we saw in the workshop, Python comes with many handy built-in functions, like print()
, type()
, and int()
. It also comes with some handy built-in libraries that we can use— in your workshop slides, see the ones titled "Using a Library". A library (also called a "package") is a collection of Python code that is grouped together and given a name (e.g. csv
) and after we import a library (e.g. import csv
) we can then use all the pieces of code that are inside of it, such as functions (e.g. csv.DictReader()
).
But if you're doing some serious data crunching, you're going to need additional libraries that don't automatically come with the Python that was already on your computer (if you're on a Mac) or which you downloaded from https://www.python.org/ (if you're on Windows). These are called 3rd party libraries, and they're free and available for anyone to use. There are many libraries that are useful for data analysis; three of the main ones are numpy
, matplotlib
, and pandas
.
numpy
is a numerical library that efficiently performs calculations on arrays of data. An array is a collection of variables of the same type (such as integer, float, string, etc.) that are organized into one or more dimensions. A 1-dimensional array is like a single column or row in a spreadsheet. A 2-dimensional array is like a table of rows and columns in a spreadsheet. Pretty much every data-related Python library uses numpy
behind the scenes—it's the heart of number crunching in Python.
matplotlib
is a data visualization library, for making graphs such as line, bar, and scatter charts, histograms, heatmaps, contour plots, and many others. It gives the user very fine control to customize all aspects of the layout and style of each graph. matplotlib
is the oldest data visualization library in Python, and very widely used. Many newer visualization libraries use matplotlib
behind the scenes, and build upon its features to create additional types of plots (e.g. geographic maps, advanced statistical graphs, etc.) or provide convenient sets of pre-defined styles for different aesthetic effects.
pandas
is a library for working with tabular (spreadsheet-like) data. It combines the number crunching power of numpy
with the visualization power of matplotlib
to reproduce pretty much all the spreadsheet functionality of Excel and Google Sheets, plus a whole lot more. Compared to working in spreadsheets, pandas
makes it much easier to handle messy and missing data, merge data from many different files, filter data to work with many different subsets, quickly summarize data in graphs and pivot tables with just a few lines of code, analyze timeseries data in an intuitive way, and much more.
Python combined with 3rd party libraries like these (and many others) creates a powerful, full-featured ecosystem for data analysis. The numpy
, matplotlib
, and pandas
libraries all require a solid understanding of Python basics, so I recommend using some of the resources from Section 1, or any others you find helpful, to keep learning and practicing until you feel comfortable with Python basics at a beginner level, before diving into the larger Python data analysis ecosystem. However, even if you're just starting out with Python, you can do your future self a favour by setting up your computer with a development environment you can use now as you're starting out and throughout the rest of your journey into the larger ecosystem. In the next section, I'll walk you through the steps for this setup.
I always say Python is like a "choose your own adventure" because there are many different ways of doing everything—from installing it on your computer, to choosing a program for editing and running code (e.g. PyCharm vs. Jupyter), to choosing which commands and functions you use to accomplish a task. This makes it very flexible and easy to do things exactly the way you want, but it can also be overwhelming when you're just starting out and don't know where to go next!
To use the Python data analysis ecosystem, you need to install 3rd party libraries such as numpy
, matplotlib
, and pandas
in such a way that Python can find them. This can be quite tricky, and there are a few different ways of doing it. Here I will be suggesting one approach, which is widely used in the data analysis, data science, and scientific computing communities, but this certainly isn't the only way to do things. You may find as you explore and experiment with different options, that another approach works better for you, so don't feel obligated to do everything the way I've suggested here.
All the software I recommend here is available for free.
If data analysis is the main reason you're using Python, I highly recommend installing Anaconda. This free software comes with everything you need to get started with data analysis, so that whenever you're ready to dive in and start learning numpy
, matplotlib
, pandas
, etc., you can do so without having to worry about the details of how to find, install, and manage all these libraries. Anaconda is a "Python distribution", which means it includes the official Python language, like we used in the workshop, plus some extra stuff:
conda
) that you can use if you need to install any additional libraries and/or manage virtual environments.When you download Anaconda, you'll need to select either the latest Python 3 version (currently Python 3.6) or Python 2.7—here's the first choice in the choose your own adventure! Either one is fine. I recommend selecting the latest Python 3 unless you already know that you need Python 2.7 for some reason. You can always install another Python version later if you need it, and multiple versions can co-exist peacefully in virtual environments in Anaconda without interfering with each other.
For details on installing Anaconda on your computer, please see the setup instructions from my PyLadies workshop. Once you've gone through this process to set up your computer with Anaconda, the great thing is that you'll have all the main data analysis libraries immediately available whenever you're ready to dive into them. You won't need to go searching around for the libraries you need and installing them yourself—they're already installed and ready for you!
In the workshop, we used the Python console in PyCharm to type in code and run it on the fly (interactive coding). This console is okay for occasional interactive coding, but it is quite limited. Data analysis is often a very interactive process: read in data from a file, do some calculations, show the results in a table or figure, and based on what you see in the results, perform additional calculations, find new directions to explore, and so on. For this type of workflow, you might find JupyterLab an easier environment to work in compared to PyCharm.
JupyterLab is a development environment specifically designed for data analysis and scientific computing, with many enhanced features to make interactive coding much easier, and it comes pre-loaded with Anaconda, so you don't need to install any additional software to start using it.
One of the most important Jupyter tools is the Jupyter notebook, where you can run Python code and also include formatted text, equations, graphs, images, and other content. It's a very handy tool in data analysis, especially when you're exploring your data and when sharing your results with others. This guide was written as a Jupyter notebook!
You can edit and run Jupyter notebooks within JupyterLab. The JupyterLab environment also includes a CSV file viewer, text editor, and other useful tools. I'm putting together a quick-start tutorial help folks quickly get up and running with JupyterLab and Jupyter notebooks—it's currently under construction but please check the link in early September 2018! You can also edit and run Jupyter notebooks from a standalone app, as demonstrated in this tutorial.
If you'd like to use PyCharm with your newly installed Anaconda version of Python, you can follow the instructions in this guide. Once you've gone through this process to set up PyCharm with Anaconda, you'll have access to all the 3rd party libraries from Anaconda when you run code within PyCharm.
As I've mentioned, one of the main libraries for data analysis is pandas
. Here I'll show you examples of the cool things you can do with pandas
. This section is intended only as a demo, rather than as a lesson or tutorial, so don't worry about understanding any of the code for now. For each bit of code, you can just check out the description above and the output below, to get a sense of what the code is doing.
Hopefully this demo will help you decide whether pandas
is a library that might be useful to you in your work. As a caveat, there is a significant learning curve to get from Python beginner to pandas
user, but going through this learning curve is likely to be a worthwhile investment if you're in any or all of the following situations:
The examples in this demo will start by reproducing the analysis we did in the workshop, to give you a sense of some of the key features of pandas
and the ways that it could make your life easier. Then I will show some deeper analyses of the workshop data, getting into fairly advanced features of Python and pandas
in a separate part 2, to show how powerful they are and give you an idea of what's possible with these tools (especially if you're already familiar with pivot tables and other advanced spreadsheet features). If the examples start to get too advanced or difficult to follow, don't get discouraged or overwhelmed, just skip them! You can get a lot of value out of the basic features of the pandas
library, and you may not have any need for the more advanced stuff.
This demo uses a program called Jupyter notebook to combine text descriptions, code, graphs, and other output into a single "notebook". Jupyter notebook comes pre-installed with Anaconda—hurray! For more information on Jupyter notebook, check out Section 6.
So, without further ado, let's take the data that we used in our workshop, and see how pandas
can help us analyze it.
First we import pandas
and create a variable chapters_data
that stores the data from 'llc-chapters.csv'
. This variable is a type that is available in pandas
, called a DataFrame. When we display it here in the notebook (by putting the variable name chapters_data
as the last line in the code blurb below) we can see that it's like a table from a spreadsheet or a .csv file, with data stored in rows and columns.
import pandas
chapters_data = pandas.read_csv('data/llc-chapters.csv')
chapters_data
How big is our table? We can look at its shape to see that there are 17 rows and 3 columns
nrows, ncols = chapters_data.shape
print('There are ' + str(nrows) + ' chapters and ' + str(ncols) + ' columns of data.')
We can look at a single row of our table:
first_row = chapters_data.iloc[0].to_dict()
first_row
And we can look at a single column. The numbers on the left correspond to the row numbers, called the "index" (remember that Python indexes start counting from 0).
leaders = chapters_data['Chapter Lead(s)']
leaders
As we did in exercise 5, we can find which chapters have co-leaders, count them, and display the city names. With pandas
we can do this with a few lines of code and we don't need any for
loops!
coleads = leaders.str.contains('&')
n_coleads = coleads.sum()
print('There are ' + str(n_coleads) + ' chapters with co-leads. These chapters are:')
chapters_data.loc[coleads, 'City']
Not needing a for
loop actually becomes really important when you're working with big data files. Python loops are fine for small data sets like the ones we explored in the workshop, but for much bigger data they are too slow. When you use pandas
, which uses numpy
behind the scenes, the code in this library is all optimized to work much faster and more efficiently with large amounts of data, compared to a Python loop.
Here we have a slightly larger data set ('llc-workshop-data.csv'
) and we can really start to see the power of pandas
!
First we read the data into a variable events
and see that it has 6487 rows and 11 columns.
events = pandas.read_csv('data/llc-workshop-data.csv')
events.shape
Let's look at the first few rows of our data table. Anywhere you see "NaN" in the table, that stands for "not a number" and indicates an empty cell in the table (missing data).
events.head()
We can see that our data encompasses events from Jan 8, 2014 to Dec 13, 2014:
print('First event ' + events['Date Attending'].min())
print('Last event ' + events['Date Attending'].max())
Each row of the events
table corresponds to a participant in an event, so for a single event there are many rows which will all have the same value in the "Event Name" column. With a single line of code, and no for
loops, we can obtain a sorted list of all the event names and the number of participants for each event name!
Let's check out the 20 event names with the most participants. Many of these events occurred multiple times during the year, so what we're looking at is the grand total over the year, for each event name.
events['Event Name'].value_counts().head(20)
Let's find how many people participated in "National Learn to Code Day" events. Once again, no loops required!
national = events['Event Name'].str.contains('National Learn to Code Day')
n_national = national.sum()
print(str(n_national) + ' attended National Learn to Code Day')
And now let's find the totals for "Kids Learning Code" plus "Girls Learning Code" events, and any youth-oriented "National Learn to Code Day" events:
kids = events['Event Name'].str.contains('Kids Learning Code')
girls = events['Event Name'].str.contains('Girls Learning Code')
youth = kids | girls
print(str(youth.sum()) + ' attended Kids Learning Code or Girls Learning Code events')
kids_national = events.loc[national, 'Event Name'].str.contains('Kids Learning Code')
girls_national = events.loc[national, 'Event Name'].str.contains('Girls Learning Code')
youth_national = kids_national | girls_national
print(str(youth_national.sum()) + ' attended youth-oriented National Learn to Code Day')
pandas
to explore the data further.¶When you were viewing the 'llc-workshop-data.csv'
file, you may have noticed there were different categories of participants in the "Ticket Type" column. Let's see the 20 most common participant categories, and the total number of participants in each:
events['Ticket Type'].value_counts().head(20)
Let's suppose we only want to count National Learn to Code Day participants in the "Yes I'd like to attend!" category. Here is that count:
ticket = "Yes, I'd like to attend!"
attend = events['Ticket Type'] == ticket
national_attend = national & attend
print(str(national_attend.sum()) + ' attended National Learn to Code Day with a ticket type: ' + ticket)
What about patterns over time? Let's see how many participants there were each month:
events['date'] = pandas.to_datetime(events['Date Attending'])
events['month'] = events['date'].dt.month
monthly = events[['Quantity', 'month']].groupby('month').sum()
monthly = monthly.rename(columns = {'Quantity' : '# Participants'})
monthly
It would be nice to see this data as a graph. We can do this with just a couple of lines of code. The %matplotlib inline
line is a special command that tells Jupyter notebook to display graphs inline in the notebook.
%matplotlib inline
monthly.plot.bar(title='Monthly Participants');
How about monthly totals for the categories we looked at earlier?
categories = ['National Learn to Code Day', 'Kids Learning Code', 'Girls Learning Code']
monthly_cat = pandas.DataFrame(index=range(1, 13), columns=categories)
monthly_cat.index.name = 'month'
for category in categories:
data = events[events['Event Name'].str.contains(category)]
monthly_cat[category] = data[['Quantity', 'month']].groupby('month').sum()['Quantity']
monthly_cat = monthly_cat.fillna(0)
monthly_cat
Looks pretty good, except you might notice something strange... National Learn to Code Day was in September, but there is one participant listed for it in November. What's going on? With pandas
we can easily investigate:
wonky = events['Event Name'].str.contains('National Learn to Code Day')
wonky = wonky & (events['month'] == 11)
events[wonky]
Sure enough, here is one participant listed for a National Learn to Code Day event, but with "Date Attending" as November 8. Let's check the rows that have same Event ID, which should correspond to the same specific event. Here are the first 10 of those rows:
event_id = int(events.loc[wonky, 'Event ID'])
events[events['Event ID'] == event_id].head(10)
We can see that in row 5279, the event is listed as a "National Learn to Code Day" event, but the other entries for this event ID are listed as a "Girls Learning Code Day" event, suggesting an error might have somehow crept in when the data was entered. Since pandas
allows us to quickly summarize the data from many different angles, such as monthly totals above, it becomes much easier to spot inconsistencies and possible errors.
Now let's check out these monthly totals by category as a bar chart:
monthly_cat.plot.bar(title='Monthly Participants');
You may have noticed in the code above, that we didn't include any commands to specify the x-axis label or the names of the series in the legend. When you create a graph in pandas
, it automatically labels things for you, based on the row and column names of your data table. You can change the labels if you want (for example, make the x-axis label "Month" instead of "month"), but when you're first exploring your data it's really handy to be able to quickly generate graphs with all the labels automatically created for you.
How about the number of participants at each event? Here's what that looks like:
att_by_event = events.groupby('Event ID').sum()['Quantity']
print(str(att_by_event.max()) + ' attended the biggest event')
print(str(att_by_event.min()) + ' attended the smallest event')
print(str(round(att_by_event.mean(), 1)) + ' was the average attendance per event')
Now let's look at totals by gender. Looking at participants in all the events, we see that they are about 32% female, 6% male, and 62% not listed.
genders = events['Gender'].value_counts(dropna=False).to_frame(name='# Participants')
genders = genders.set_index(genders.index.fillna('Not listed'))
genders['% Participants'] = 100 * genders['# Participants'] / genders['# Participants'].sum()
genders
And there are many other things we could to do explore this data further. To dive even deeper into pandas
and explore some of its more advanced features, you can check out part 2 of this demo.
Hopefully this section has given you an idea of whether pandas
would be a good tool to use in your own data analysis!
There are many, many different data visualization libraries for making graphs and images in Python. In this section, I'll show examples of the kinds of graphs you can make with pandas
(which uses matplotlib
behind the scenes for creating graphs) and another popular library seaborn
, which is also built on matplotlib
. For many more examples of visualizations that can be created in these libraries, you can scroll through the images in these galleries:
The above libraries are just a tiny fraction of the data visualization world in Python, which is a whole ecosystem of its own. Another important cluster of libraries within this ecosystem are libraries for creating interactive web-based plots. Two popular ones are bokeh
and plotly
, but there are many others.
For a more complete picture of the different visualization tools available in Python, check out this great talk by Jake VanderPlas.
This section is intended only as a demo, rather than as a lesson or tutorial, so don't worry about understanding any of the code for now. For each bit of code, just check out the description above and the output below, to get a sense of what the code is doing.
We'll start out by importing some of the libraries we'll use in this section, and renaming them to shorthand names (np
, plt
, pd
, and sns
). These are standard shorthands that you'll likely see throughout online tutorials and documentation for these libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Next we read in some data from part 2 of the pandas
demo, which gives a breakdown by city of the Canada Learning Code events in 2014:
events_by_city = pd.read_csv('data/llc-workshop-attendance-by-city.csv', index_col=0)
events_by_city.head()
Here's a stacked bar chart of the breakdown by city for the five most popular Canada Learning Code events in 2014 (from the example data we worked with in the previous section):
top_five = events_by_city.head(5).drop('Total', axis=1).T
top_five.plot.bar(stacked=True, figsize=(10, 4))
plt.ylabel('# Participants')
plt.legend(loc='upper left', frameon=False);
We can show the data as a horizontal stacked bar chart instead. Let's also sort the cities by the total number of participants across these top 5 events, and add some gridlines.
sorted_top_five = top_five.copy()
sorted_top_five['Total'] = sorted_top_five.sum(axis=1)
sorted_top_five = sorted_top_five.sort_values('Total', ascending=True).drop('Total', axis=1)
with sns.axes_style('darkgrid'):
sorted_top_five.plot.barh(stacked=True, figsize=(10, 7))
plt.xlabel('# Participants')
plt.legend(loc='center right')
Here's an example of a line chart, using some fake timeseries data:
dates = pd.date_range('2017-01-01', '2017-12-31')
ts = pd.DataFrame(np.random.randn(len(dates), 3), index=dates, columns=['A', 'B', 'C'])
ts = ts.cumsum()
ts.plot(figsize=(8, 3));
We can also plot these as separate line charts in subplots:
ts.plot(figsize=(8, 6), subplots=True);
Here's an example of creating a plot in the seaborn
library, which can quickly generate graphs for many common statistical analyses.
We'll use a famous data set, the Fisher's iris data, which is a set of measurements of the petals and sepals of three species of Iris flowers. This data is available as a built-in example that can be loaded from seaborn
.
iris = sns.load_dataset('iris')
iris.head()
Let's look at distributions and relationships between measurements. With a single line of code, we can create a set of scatter plots and histograms for all four variables:
sns.pairplot(iris, hue="species", size=2);
seaborn
also provides a variety of pre-configured styles that can be applied to quickly change the look of a graph or a whole collection of graphs. Here's an example of applying a different style to the above graph:
palette = sns.color_palette('deep')
with sns.axes_style('darkgrid'):
sns.pairplot(iris, hue="species", size=2, palette=palette);
The example graphs in this section could easily be customized in many other ways. To get a sense of these customizations, and of the other kinds of graphs that can be created with these libraries, check out the galleries linked at the start of this section.
We've seen a very small sampling of Python's data visualization capabilities. There are many more libraries you can explore, which have been developed for a wide variety of purposes, such as maps and geographic data, 3-dimensional visualizations, interactive graphs for web pages, and more!
When you've become comfortable with the fundamentals of Python and want to start learning numpy
, matplotlib
, pandas
and other data analysis libraries, there are many resources available to choose from. Of the sites I listed in Section 1, DataCamp and DataQuest both offer online courses covering these topics in depth, and there are many other online learning platforms with in-depth courses on these topics as well. You can also search for tutorials for each library to find many helpful resources.
If you'd like to learn more about the Python data analysis ecosystem, such as how all the pieces fit together, which tools might be useful for you, how Python evolved into the data crunching powerhouse that it is today, and some of the challenges for new users navigating the Python world, I highly recommend this fantastic keynote talk by Jake VanderPlas from the PyData Seattle 2017 conference.
Another great resource from Jake VanderPlas is his Youtube video series "Reproducible Data Analysis in Jupyter". Part 1 and part 2 give a really nice demonstration of how to use pandas
and Jupyter notebook to analyze patterns in a dataset of hourly counts of bike trips across the Fremont bridge in Seattle. If you're at a more intermediate or advanced level with Python, you might want to check out the rest of the videos in the 10-part series for all sorts of great additional tips and tricks. If you're familiar with principal component analysis and unsupervised clustering, check out part 10 for a very neat analysis showing how the bike trip data can be used to distinguish regular weekdays from weekends, holidays, and even a big snow storm!
For some more background on why I think Jupyter is so awesome for data analysis, you can check out this great talk by Fernando Perez, who was the original creator of these tools and now leads a large team of developers at Berkeley, working on all kinds of cool new features for the Jupyter project. I especially like his philosophy about "human-centered" interactive software for scientific computing and data analysis, and the ethical importance of free, open-source tools to make science accessible to folks all around the world who may not all have budgets for expensive proprietary software.
I like learning from books and when I was learning Python I found this book, by Wes McKinney, creator of pandas
, very helpful. It has a companion site with data and Jupyter notebooks corresponding to all the examples in the book, so you can work through them yourself.
If you really want to geek out, check out the Zen of Python here or type import this
into the IPython shell or PyCharm console.
Hopefully this guide will help you decide if Python is a good fit for the work you're doing, and if so, will help you navigate the world of Python data analysis and give you some ideas of your next steps and path forward. If you have any questions, comments, suggestions, or any other feedback, please send me an email—I'd love to hear from you. Good luck in your Python journey!