Data Insights with Python for Beginners

Next steps and additional resources for Ladies Learning Code workshop participants

Jennifer Walker | jenfly (at) gmail (dot) com

This page is under active development, with frequent updates as I add more resources
All the Jupyter notebook source code and data are available in the Github repo
Data and examples on this page are adapted from the workshop materials from Ladies Learning Code

Please email me with any questions, comments, or suggestions. I'd love to hear from you!

0. Introduction¶

This guide is primarily geared towards folks who are already doing a lot of data analysis in Excel or Google Sheets and are considering switching over to Python for these tasks. The objectives are:

Give you an overview of Python's capabilities in data analysis, to help you decide if Python is the right tool for you
Help you set up your computer with everything you need to continue learning Python in a development environment that you can use from beginner level all the way to Python data wizard

If you're just learning Python for fun, or as an introduction to coding in general, you might want to check out the resources listed in Section 1, but for now you probably want to stick with the PyCharm setup that we used in the workshop, rather than installing any new programs. And if your primary goal with Python is more traditional software development (such as web apps, etc.) rather than data analysis, the approach I suggest in Section 3 also might not be the best setup for you, and you'll want to explore other resources that are more oriented towards traditional software development. If you're an avid data cruncher, then read on!

1. Learning Python fundamentals¶

There are many online Python courses which are great for beginners. Below are a few popular ones—the first two are oriented specifically towards data analysis. These courses are either free or have some free content you can sample.

The free resources below are geared towards scientists and are a bit faster paced, but still start with the basics:

For data analysis, you'll want to develop a good understanding of Python fundamentals up to about an intermediate level. This would include delving deeper into all the topics we covered in the workshop (e.g. variables, types, if statements and logic, lists, for loops, dictionaries, libraries, and reading from files) and additional topics such as: string operations, tuples, sets, namespaces, user-defined functions, file output, modules, and debugging. Depending on your Python needs, you might also want to learn about other topics such as packaging, classes, error handling, and virtual environments.

back to top

2. Python data analysis ecosystem¶

As we saw in the workshop, Python comes with many handy built-in functions, like print(), type(), and int(). It also comes with some handy built-in libraries that we can use— in your workshop slides, see the ones titled "Using a Library". A library (also called a "package") is a collection of Python code that is grouped together and given a name (e.g. csv) and after we import a library (e.g. import csv) we can then use all the pieces of code that are inside of it, such as functions (e.g. csv.DictReader()).

But if you're doing some serious data crunching, you're going to need additional libraries that don't automatically come with the Python that was already on your computer (if you're on a Mac) or which you downloaded from https://www.python.org/ (if you're on Windows). These are called 3rd party libraries, and they're free and available for anyone to use. There are many libraries that are useful for data analysis; three of the main ones are numpy, matplotlib, and pandas.

numpy is a numerical library that efficiently performs calculations on arrays of data. An array is a collection of variables of the same type (such as integer, float, string, etc.) that are organized into one or more dimensions. A 1-dimensional array is like a single column or row in a spreadsheet. A 2-dimensional array is like a table of rows and columns in a spreadsheet. Pretty much every data-related Python library uses numpy behind the scenes—it's the heart of number crunching in Python.

matplotlib is a data visualization library, for making graphs such as line, bar, and scatter charts, histograms, heatmaps, contour plots, and many others. It gives the user very fine control to customize all aspects of the layout and style of each graph. matplotlib is the oldest data visualization library in Python, and very widely used. Many newer visualization libraries use matplotlib behind the scenes, and build upon its features to create additional types of plots (e.g. geographic maps, advanced statistical graphs, etc.) or provide convenient sets of pre-defined styles for different aesthetic effects.

pandas is a library for working with tabular (spreadsheet-like) data. It combines the number crunching power of numpy with the visualization power of matplotlib to reproduce pretty much all the spreadsheet functionality of Excel and Google Sheets, plus a whole lot more. Compared to working in spreadsheets, pandas makes it much easier to handle messy and missing data, merge data from many different files, filter data to work with many different subsets, quickly summarize data in graphs and pivot tables with just a few lines of code, analyze timeseries data in an intuitive way, and much more.

Python combined with 3rd party libraries like these (and many others) creates a powerful, full-featured ecosystem for data analysis. The numpy, matplotlib, and pandas libraries all require a solid understanding of Python basics, so I recommend using some of the resources from Section 1, or any others you find helpful, to keep learning and practicing until you feel comfortable with Python basics at a beginner level, before diving into the larger Python data analysis ecosystem. However, even if you're just starting out with Python, you can do your future self a favour by setting up your computer with a development environment you can use now as you're starting out and throughout the rest of your journey into the larger ecosystem. In the next section, I'll walk you through the steps for this setup.

back to top

3. Setting up your computer¶

I always say Python is like a "choose your own adventure" because there are many different ways of doing everything—from installing it on your computer, to choosing a program for editing and running code (e.g. PyCharm vs. Jupyter), to choosing which commands and functions you use to accomplish a task. This makes it very flexible and easy to do things exactly the way you want, but it can also be overwhelming when you're just starting out and don't know where to go next!

To use the Python data analysis ecosystem, you need to install 3rd party libraries such as numpy, matplotlib, and pandas in such a way that Python can find them. This can be quite tricky, and there are a few different ways of doing it. Here I will be suggesting one approach, which is widely used in the data analysis, data science, and scientific computing communities, but this certainly isn't the only way to do things. You may find as you explore and experiment with different options, that another approach works better for you, so don't feel obligated to do everything the way I've suggested here.

All the software I recommend here is available for free.

back to top

3.1 Anaconda¶

If data analysis is the main reason you're using Python, I highly recommend installing Anaconda. This free software comes with everything you need to get started with data analysis, so that whenever you're ready to dive in and start learning numpy, matplotlib, pandas, etc., you can do so without having to worry about the details of how to find, install, and manage all these libraries. Anaconda is a "Python distribution", which means it includes the official Python language, like we used in the workshop, plus some extra stuff:

all the most common 3rd party libraries for data analysis are pre-installed so you don't need to install them yourself, and
it includes a package manager (conda) that you can use if you need to install any additional libraries and/or manage virtual environments.

When you download Anaconda, you'll need to select either the latest Python 3 version (currently Python 3.6) or Python 2.7—here's the first choice in the choose your own adventure! Either one is fine. I recommend selecting the latest Python 3 unless you already know that you need Python 2.7 for some reason. You can always install another Python version later if you need it, and multiple versions can co-exist peacefully in virtual environments in Anaconda without interfering with each other.

For details on installing Anaconda on your computer, please see the setup instructions from my PyLadies workshop. Once you've gone through this process to set up your computer with Anaconda, the great thing is that you'll have all the main data analysis libraries immediately available whenever you're ready to dive into them. You won't need to go searching around for the libraries you need and installing them yourself—they're already installed and ready for you!

back to top

3.2 JupyterLab¶

In the workshop, we used the Python console in PyCharm to type in code and run it on the fly (interactive coding). This console is okay for occasional interactive coding, but it is quite limited. Data analysis is often a very interactive process: read in data from a file, do some calculations, show the results in a table or figure, and based on what you see in the results, perform additional calculations, find new directions to explore, and so on. For this type of workflow, you might find JupyterLab an easier environment to work in compared to PyCharm.

JupyterLab is a development environment specifically designed for data analysis and scientific computing, with many enhanced features to make interactive coding much easier, and it comes pre-loaded with Anaconda, so you don't need to install any additional software to start using it.

One of the most important Jupyter tools is the Jupyter notebook, where you can run Python code and also include formatted text, equations, graphs, images, and other content. It's a very handy tool in data analysis, especially when you're exploring your data and when sharing your results with others. This guide was written as a Jupyter notebook!

You can edit and run Jupyter notebooks within JupyterLab. The JupyterLab environment also includes a CSV file viewer, text editor, and other useful tools. I'm putting together a quick-start tutorial help folks quickly get up and running with JupyterLab and Jupyter notebooks—it's currently under construction but please check the link in early September 2018! You can also edit and run Jupyter notebooks from a standalone app, as demonstrated in this tutorial.

back to top

3.3 Using PyCharm with Anaconda¶

If you'd like to use PyCharm with your newly installed Anaconda version of Python, you can follow the instructions in this guide. Once you've gone through this process to set up PyCharm with Anaconda, you'll have access to all the 3rd party libraries from Anaconda when you run code within PyCharm.

back to top

4. Pandas demo¶

As I've mentioned, one of the main libraries for data analysis is pandas. Here I'll show you examples of the cool things you can do with pandas. This section is intended only as a demo, rather than as a lesson or tutorial, so don't worry about understanding any of the code for now. For each bit of code, you can just check out the description above and the output below, to get a sense of what the code is doing.

Hopefully this demo will help you decide whether pandas is a library that might be useful to you in your work. As a caveat, there is a significant learning curve to get from Python beginner to pandas user, but going through this learning curve is likely to be a worthwhile investment if you're in any or all of the following situations:

Working with data that is messy and/or split up into many different files that need to be merged together
Using long, complicated spreadsheet formulas that are unwieldy and annoying to edit
Working with large amounts of data and finding that Excel/Sheets is slowing down, freezing, or crashing
Finding yourself doing a lot of repetitive spreadsheets tasks and wanting to automate them
Wanting to create customized graphs beyond the ones available in Excel/Sheets
... and probably some other situtations that I haven't thought of yet!

The examples in this demo will start by reproducing the analysis we did in the workshop, to give you a sense of some of the key features of pandas and the ways that it could make your life easier. Then I will show some deeper analyses of the workshop data, getting into fairly advanced features of Python and pandas in a separate part 2, to show how powerful they are and give you an idea of what's possible with these tools (especially if you're already familiar with pivot tables and other advanced spreadsheet features). If the examples start to get too advanced or difficult to follow, don't get discouraged or overwhelmed, just skip them! You can get a lot of value out of the basic features of the pandas library, and you may not have any need for the more advanced stuff.

This demo uses a program called Jupyter notebook to combine text descriptions, code, graphs, and other output into a single "notebook". Jupyter notebook comes pre-installed with Anaconda—hurray! For more information on Jupyter notebook, check out Section 6.

So, without further ado, let's take the data that we used in our workshop, and see how pandas can help us analyze it.

back to top

Example 1: Ladies Learning Code Chapters¶

First we import pandas and create a variable chapters_data that stores the data from 'llc-chapters.csv'. This variable is a type that is available in pandas, called a DataFrame. When we display it here in the notebook (by putting the variable name chapters_data as the last line in the code blurb below) we can see that it's like a table from a spreadsheet or a .csv file, with data stored in rows and columns.

import pandas

chapters_data = pandas.read_csv('data/llc-chapters.csv')
chapters_data

How big is our table? We can look at its shape to see that there are 17 rows and 3 columns

nrows, ncols = chapters_data.shape
print('There are ' + str(nrows) + ' chapters and ' + str(ncols) + ' columns of data.')

There are 17 chapters and 3 columns of data.

We can look at a single row of our table:

first_row = chapters_data.iloc[0].to_dict()
first_row

{'Chapter Lead(s)': 'Meghan', 'City': 'Vancouver', 'Province': 'BC'}

And we can look at a single column. The numbers on the left correspond to the row numbers, called the "index" (remember that Python indexes start counting from 0).

leaders = chapters_data['Chapter Lead(s)']
leaders

0                      Meghan
1                      Darcie
2                 Bree & Dana
3            Brittany & Marli
4          Michelle & Jessica
5                     Lindsay
6                   Christine
7                 Meg & Abena
8     Christopher & MacKenzie
9                        Lisa
10             Kelly & Jennie
11             Erika & Cassie
12           Jasmine & Cassie
13                       Dana
14           Erin & Christina
15                    Amandah
16         Guillaume & Karine
Name: Chapter Lead(s), dtype: object

As we did in exercise 5, we can find which chapters have co-leaders, count them, and display the city names. With pandas we can do this with a few lines of code and we don't need any for loops!

coleads = leaders.str.contains('&')
n_coleads = coleads.sum()
print('There are ' + str(n_coleads) + ' chapters with co-leads. These chapters are:')
chapters_data.loc[coleads, 'City']

There are 10 chapters with co-leads. These chapters are:

2      Edmonton
3     Saskatoon
4      Winnipeg
7      Hamilton
8       Halifax
10       London
11     Montreal
12       Ottawa
14     Victoria
16       Quebec
Name: City, dtype: object

Not needing a for loop actually becomes really important when you're working with big data files. Python loops are fine for small data sets like the ones we explored in the workshop, but for much bigger data they are too slow. When you use pandas, which uses numpy behind the scenes, the code in this library is all optimized to work much faster and more efficiently with large amounts of data, compared to a Python loop.

back to top

Example 2: Canada Learning Code Events¶

Here we have a slightly larger data set ('llc-workshop-data.csv') and we can really start to see the power of pandas!

First we read the data into a variable events and see that it has 6487 rows and 11 columns.

events = pandas.read_csv('data/llc-workshop-data.csv')
events.shape

(6487, 11)

Let's look at the first few rows of our data table. Anywhere you see "NaN" in the table, that stands for "not a number" and indicates an empty cell in the table (missing data).

events.head()

We can see that our data encompasses events from Jan 8, 2014 to Dec 13, 2014:

print('First event ' + events['Date Attending'].min())
print('Last event ' + events['Date Attending'].max())

First event 2014-01-08
Last event 2014-12-13

Each row of the events table corresponds to a participant in an event, so for a single event there are many rows which will all have the same value in the "Event Name" column. With a single line of code, and no for loops, we can obtain a sorted list of all the event names and the number of participants for each event name!

Let's check out the 20 event names with the most participants. Many of these events occurred multiple times during the year, so what we're looking at is the grand total over the year, for each event name.

events['Event Name'].value_counts().head(20)

Intro to HTML & CSS  (Toronto)                                                                            424
WordPress for Beginners in Toronto                                                                        219
Intro to HTML & CSS in Toronto                                                                            171
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website in Toronto             169
Introduction to JavaScript  (Toronto)                                                                     108
CSS Fundamentals for Beginners in Toronto                                                                 107
Intro to Photoshop  in Toronto                                                                             95
Intro to HTML5 & Responsive Design in Toronto                                                              94
Intro to HTML & CSS: Building a One Page Website (Victoria Edition)                                        88
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website (Victoria Edition)      81
Intro to HTML & CSS (Edmonton Edition)                                                                     81
Girls Learning Code Day: Intro to HTML & CSS in Victoria! (ages 8-13)                                      79
Weekday Intro to HTML & CSS: Building a Multi-Page Website in Toronto                                      78
WordPress for Beginners  (Toronto)                                                                         76
Intro to HTML5 and Responsive Design in Toronto                                                            74
Creative Coding and Data Visualization with Processing  (Toronto)                                          66
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website (Vancouver Edition)     65
Introduction to Web Design in Toronto                                                                      64
Nov 15th: Creative Coding and Data Visualization with Processing in Toronto                                63
Intro to HTML & CSS (Vancouver Edition)                                                                    57
Name: Event Name, dtype: int64

Let's find how many people participated in "National Learn to Code Day" events. Once again, no loops required!

national = events['Event Name'].str.contains('National Learn to Code Day')
n_national = national.sum()
print(str(n_national) + ' attended National Learn to Code Day')

799 attended National Learn to Code Day

And now let's find the totals for "Kids Learning Code" plus "Girls Learning Code" events, and any youth-oriented "National Learn to Code Day" events:

kids = events['Event Name'].str.contains('Kids Learning Code')
girls = events['Event Name'].str.contains('Girls Learning Code')
youth = kids | girls
print(str(youth.sum()) + ' attended Kids Learning Code or Girls Learning Code events')

kids_national = events.loc[national, 'Event Name'].str.contains('Kids Learning Code')
girls_national = events.loc[national, 'Event Name'].str.contains('Girls Learning Code')
youth_national = kids_national | girls_national

print(str(youth_national.sum()) + ' attended youth-oriented National Learn to Code Day')

1575 attended Kids Learning Code or Girls Learning Code events
71 attended youth-oriented National Learn to Code Day

back to top

That covers the exercises we did in the workshop. Now let's see how we could use `pandas` to explore the data further.¶

When you were viewing the 'llc-workshop-data.csv' file, you may have noticed there were different categories of participants in the "Ticket Type" column. Let's see the 20 most common participant categories, and the total number of participants in each:

events['Ticket Type'].value_counts().head(20)

Yes, I'd like to attend!                   3671
Yes, I'd like to mentor!                   1155
Register a Girl & Parent/Guardian           522
Register a girl                             261
Register to mentor                          198
Register a boy!                             172
Register a girl!                            150
Register a boy                               90
Make a Donation!                             35
Yes, I'd like to register a teen girl!       34
Register a girl (Bring Your Own Laptop)      20
Ladies Learning Code Alumni                  17
Register a boy! (Bring Your Own Laptop)      15
Yes, I'd like to volunteer!                  15
Register!                                    14
Register to Mentor                           10
Register a boy (Use Our Laptop)              10
Register a girl (Use Our Laptop)             10
Register a Girl + Parent/Guardian             9
Yes I'd like to volunteer!                    9
Name: Ticket Type, dtype: int64

Let's suppose we only want to count National Learn to Code Day participants in the "Yes I'd like to attend!" category. Here is that count:

ticket = "Yes, I'd like to attend!"
attend = events['Ticket Type'] == ticket
national_attend = national & attend
print(str(national_attend.sum()) + ' attended National Learn to Code Day with a ticket type: ' + ticket)

524 attended National Learn to Code Day with a ticket type: Yes, I'd like to attend!

What about patterns over time? Let's see how many participants there were each month:

events['date'] = pandas.to_datetime(events['Date Attending'])
events['month'] = events['date'].dt.month
monthly = events[['Quantity', 'month']].groupby('month').sum()
monthly = monthly.rename(columns = {'Quantity' : '# Participants'})
monthly

It would be nice to see this data as a graph. We can do this with just a couple of lines of code. The %matplotlib inline line is a special command that tells Jupyter notebook to display graphs inline in the notebook.

%matplotlib inline

monthly.plot.bar(title='Monthly Participants');

How about monthly totals for the categories we looked at earlier?

categories = ['National Learn to Code Day', 'Kids Learning Code', 'Girls Learning Code']
monthly_cat = pandas.DataFrame(index=range(1, 13), columns=categories)
monthly_cat.index.name = 'month'
for category in categories:
    data = events[events['Event Name'].str.contains(category)]
    monthly_cat[category] = data[['Quantity', 'month']].groupby('month').sum()['Quantity']
monthly_cat = monthly_cat.fillna(0)
monthly_cat

Looks pretty good, except you might notice something strange... National Learn to Code Day was in September, but there is one participant listed for it in November. What's going on? With pandas we can easily investigate:

wonky = events['Event Name'].str.contains('National Learn to Code Day')
wonky = wonky & (events['month'] == 11)
events[wonky]

Sure enough, here is one participant listed for a National Learn to Code Day event, but with "Date Attending" as November 8. Let's check the rows that have same Event ID, which should correspond to the same specific event. Here are the first 10 of those rows:

event_id = int(events.loc[wonky, 'Event ID'])
events[events['Event ID'] == event_id].head(10)

We can see that in row 5279, the event is listed as a "National Learn to Code Day" event, but the other entries for this event ID are listed as a "Girls Learning Code Day" event, suggesting an error might have somehow crept in when the data was entered. Since pandas allows us to quickly summarize the data from many different angles, such as monthly totals above, it becomes much easier to spot inconsistencies and possible errors.

Now let's check out these monthly totals by category as a bar chart:

monthly_cat.plot.bar(title='Monthly Participants');

You may have noticed in the code above, that we didn't include any commands to specify the x-axis label or the names of the series in the legend. When you create a graph in pandas, it automatically labels things for you, based on the row and column names of your data table. You can change the labels if you want (for example, make the x-axis label "Month" instead of "month"), but when you're first exploring your data it's really handy to be able to quickly generate graphs with all the labels automatically created for you.

How about the number of participants at each event? Here's what that looks like:

att_by_event = events.groupby('Event ID').sum()['Quantity']
print(str(att_by_event.max()) + ' attended the biggest event')
print(str(att_by_event.min()) + ' attended the smallest event')
print(str(round(att_by_event.mean(), 1)) + ' was the average attendance per event')

169 attended the biggest event
1 attended the smallest event
33.1 was the average attendance per event

Now let's look at totals by gender. Looking at participants in all the events, we see that they are about 32% female, 6% male, and 62% not listed.

genders = events['Gender'].value_counts(dropna=False).to_frame(name='# Participants')
genders = genders.set_index(genders.index.fillna('Not listed'))
genders['% Participants'] = 100 * genders['# Participants'] / genders['# Participants'].sum()
genders

And there are many other things we could to do explore this data further. To dive even deeper into pandas and explore some of its more advanced features, you can check out part 2 of this demo.

Hopefully this section has given you an idea of whether pandas would be a good tool to use in your own data analysis!

back to top

5. Data visualization demo¶

There are many, many different data visualization libraries for making graphs and images in Python. In this section, I'll show examples of the kinds of graphs you can make with pandas (which uses matplotlib behind the scenes for creating graphs) and another popular library seaborn, which is also built on matplotlib. For many more examples of visualizations that can be created in these libraries, you can scroll through the images in these galleries:

The above libraries are just a tiny fraction of the data visualization world in Python, which is a whole ecosystem of its own. Another important cluster of libraries within this ecosystem are libraries for creating interactive web-based plots. Two popular ones are bokeh and plotly, but there are many others.

For a more complete picture of the different visualization tools available in Python, check out this great talk by Jake VanderPlas.

This section is intended only as a demo, rather than as a lesson or tutorial, so don't worry about understanding any of the code for now. For each bit of code, just check out the description above and the output below, to get a sense of what the code is doing.

We'll start out by importing some of the libraries we'll use in this section, and renaming them to shorthand names (np, plt, pd, and sns). These are standard shorthands that you'll likely see throughout online tutorials and documentation for these libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Next we read in some data from part 2 of the pandas demo, which gives a breakdown by city of the Canada Learning Code events in 2014:

events_by_city = pd.read_csv('data/llc-workshop-attendance-by-city.csv', index_col=0)
events_by_city.head()

Here's a stacked bar chart of the breakdown by city for the five most popular Canada Learning Code events in 2014 (from the example data we worked with in the previous section):

top_five = events_by_city.head(5).drop('Total', axis=1).T
top_five.plot.bar(stacked=True, figsize=(10, 4))
plt.ylabel('# Participants')
plt.legend(loc='upper left', frameon=False);

We can show the data as a horizontal stacked bar chart instead. Let's also sort the cities by the total number of participants across these top 5 events, and add some gridlines.

sorted_top_five = top_five.copy()
sorted_top_five['Total'] = sorted_top_five.sum(axis=1)
sorted_top_five = sorted_top_five.sort_values('Total', ascending=True).drop('Total', axis=1)
with sns.axes_style('darkgrid'):
    sorted_top_five.plot.barh(stacked=True, figsize=(10, 7))
    plt.xlabel('# Participants')
    plt.legend(loc='center right')

Here's an example of a line chart, using some fake timeseries data:

dates = pd.date_range('2017-01-01', '2017-12-31')
ts = pd.DataFrame(np.random.randn(len(dates), 3), index=dates, columns=['A', 'B', 'C'])
ts = ts.cumsum()
ts.plot(figsize=(8, 3));

We can also plot these as separate line charts in subplots:

ts.plot(figsize=(8, 6), subplots=True);

Here's an example of creating a plot in the seaborn library, which can quickly generate graphs for many common statistical analyses.

We'll use a famous data set, the Fisher's iris data, which is a set of measurements of the petals and sepals of three species of Iris flowers. This data is available as a built-in example that can be loaded from seaborn.

iris = sns.load_dataset('iris')
iris.head()

Let's look at distributions and relationships between measurements. With a single line of code, we can create a set of scatter plots and histograms for all four variables:

sns.pairplot(iris, hue="species", size=2);

seaborn also provides a variety of pre-configured styles that can be applied to quickly change the look of a graph or a whole collection of graphs. Here's an example of applying a different style to the above graph:

palette = sns.color_palette('deep')
with sns.axes_style('darkgrid'):
    sns.pairplot(iris, hue="species", size=2, palette=palette);

The example graphs in this section could easily be customized in many other ways. To get a sense of these customizations, and of the other kinds of graphs that can be created with these libraries, check out the galleries linked at the start of this section.

We've seen a very small sampling of Python's data visualization capabilities. There are many more libraries you can explore, which have been developed for a wide variety of purposes, such as maps and geographic data, 3-dimensional visualizations, interactive graphs for web pages, and more!

back to top

6. Additional resources¶

When you've become comfortable with the fundamentals of Python and want to start learning numpy, matplotlib, pandas and other data analysis libraries, there are many resources available to choose from. Of the sites I listed in Section 1, DataCamp and DataQuest both offer online courses covering these topics in depth, and there are many other online learning platforms with in-depth courses on these topics as well. You can also search for tutorials for each library to find many helpful resources.

If you'd like to learn more about the Python data analysis ecosystem, such as how all the pieces fit together, which tools might be useful for you, how Python evolved into the data crunching powerhouse that it is today, and some of the challenges for new users navigating the Python world, I highly recommend this fantastic keynote talk by Jake VanderPlas from the PyData Seattle 2017 conference.

Another great resource from Jake VanderPlas is his Youtube video series "Reproducible Data Analysis in Jupyter". Part 1 and part 2 give a really nice demonstration of how to use pandas and Jupyter notebook to analyze patterns in a dataset of hourly counts of bike trips across the Fremont bridge in Seattle. If you're at a more intermediate or advanced level with Python, you might want to check out the rest of the videos in the 10-part series for all sorts of great additional tips and tricks. If you're familiar with principal component analysis and unsupervised clustering, check out part 10 for a very neat analysis showing how the bike trip data can be used to distinguish regular weekdays from weekends, holidays, and even a big snow storm!

For some more background on why I think Jupyter is so awesome for data analysis, you can check out this great talk by Fernando Perez, who was the original creator of these tools and now leads a large team of developers at Berkeley, working on all kinds of cool new features for the Jupyter project. I especially like his philosophy about "human-centered" interactive software for scientific computing and data analysis, and the ethical importance of free, open-source tools to make science accessible to folks all around the world who may not all have budgets for expensive proprietary software.

I like learning from books and when I was learning Python I found this book, by Wes McKinney, creator of pandas, very helpful. It has a companion site with data and Jupyter notebooks corresponding to all the examples in the book, so you can work through them yourself.

If you really want to geek out, check out the Zen of Python here or type import this into the IPython shell or PyCharm console.

Hopefully this guide will help you decide if Python is a good fit for the work you're doing, and if so, will help you navigate the world of Python data analysis and give you some ideas of your next steps and path forward. If you have any questions, comments, suggestions, or any other feedback, please send me an email—I'd love to hear from you. Good luck in your Python journey!

back to top

	Event Name	Event ID	Order #	Order Date	Quantity	Ticket Type	Attendee #	Date Attending	Order Type	Gender	How did you hear about this event?
0	Introduction to HTML & CSS in Toronto	9849231316	236882194	2013-12-16	1	Yes, I'd like to attend!	300683796	2014-01-08	PayPal Completed	Female	NaN
1	Introduction to HTML & CSS in Toronto	9849231316	236888382	2013-12-16	1	Yes, I'd like to attend!	300691338	2014-01-08	PayPal Completed	Female	NaN
2	Introduction to HTML & CSS in Toronto	9849231316	236916392	2013-12-16	1	Yes, I'd like to attend!	300726210	2014-01-08	PayPal Completed	Female	NaN
3	Introduction to HTML & CSS in Toronto	9849231316	237225952	2013-12-18	1	Yes, I'd like to attend!	301102696	2014-01-08	PayPal Completed	Female	NaN
4	Introduction to HTML & CSS in Toronto	9849231316	238323753	2013-12-25	1	Yes, I'd like to attend!	302451151	2014-01-08	PayPal Completed	Female	NaN

	# Participants
month
1	509
2	649
3	505
4	708
5	435
6	425
7	261
8	155
9	981
10	482
11	1101
12	276

	National Learn to Code Day	Kids Learning Code	Girls Learning Code
month
1	0.0	18.0	14.0
2	0.0	59.0	34.0
3	0.0	47.0	25.0
4	0.0	61.0	24.0
5	0.0	50.0	70.0
6	0.0	53.0	49.0
7	0.0	13.0	0.0
8	0.0	0.0	0.0
9	798.0	62.0	73.0
10	0.0	31.0	37.0
11	1.0	128.0	603.0
12	0.0	87.0	37.0

	Event Name	Event ID	Order #	Order Date	Quantity	Ticket Type	Attendee #	Date Attending	Order Type	Gender	How did you hear about this event?	date	month
5274	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	352026737	2014-10-02	1	Register a Girl & Parent/Guardian	445287897	2014-11-08	Free Order	NaN	Other	2014-11-08	11
5275	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	353034505	2014-10-05	1	Register a Girl & Parent/Guardian	446552415	2014-11-08	Free Order	NaN	From a Friend	2014-11-08	11
5276	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	353034505	2014-10-05	1	Register a Girl & Parent/Guardian	446552419	2014-11-08	Free Order	NaN	Other	2014-11-08	11
5277	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	354093537	2014-10-07	1	Register a Girl & Parent/Guardian	447869453	2014-11-08	Free Order	NaN	From a Friend	2014-11-08	11
5278	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	354987441	2014-10-08	1	Register a Girl & Parent/Guardian	448978279	2014-11-08	Free Order	NaN	From a Friend	2014-11-08	11
5279	National Learn to Code Day: Intro to HTML & CS...	13209711603	356187119	2014-10-10	1	Register to mentor	450472457	2014-11-08	Free Order	NaN	Other	2014-11-08	11
5280	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	356406519	2014-10-11	1	Register to mentor	450742683	2014-11-08	Free Order	NaN	Other	2014-11-08	11
5281	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	356875759	2014-10-12	1	Register a Girl & Parent/Guardian	451322811	2014-11-08	Free Order	NaN	Other	2014-11-08	11
5282	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	357160335	2014-10-13	1	Register a Girl & Parent/Guardian	451665465	2014-11-08	Free Order	NaN	From a Friend	2014-11-08	11
5283	Girls Learning Code Day: Intro to HTML & CSS i...	13209711603	357322717	2014-10-13	1	Register a Girl & Parent/Guardian	451864891	2014-11-08	Free Order	NaN	Other	2014-11-08	11

	# Participants	% Participants
Not listed	4024	62.031756
Female	2100	32.372437
Male	363	5.595807

	City	Province	Chapter Lead(s)
0	Vancouver	BC	Meghan
1	Calgary	AB	Darcie
2	Edmonton	AB	Bree & Dana
3	Saskatoon	SK	Brittany & Marli
4	Winnipeg	MB	Michelle & Jessica
5	Toronto	ON	Lindsay
6	Barrie	ON	Christine
7	Hamilton	ON	Meg & Abena
8	Halifax	NS	Christopher & MacKenzie
9	Fredericton	NB	Lisa
10	London	ON	Kelly & Jennie
11	Montreal	QC	Erika & Cassie
12	Ottawa	ON	Jasmine & Cassie
13	St Johns	NL	Dana
14	Victoria	BC	Erin & Christina
15	Waterloo	ON	Amandah
16	Quebec	QC	Guillaume & Karine

	Barrie	Calgary	Edmonton	Fredericton	Halifax	Hamilton	Kitchener/Waterloo	London	Montreal	N/A	...	Saskatoon	St. John's	Sydney	Toronto	Vancouver	Victoria	Waterloo	Whitehorse	Winnipeg	Total
Event Name (standardized)
Intro to HTML & CSS	0.0	40.0	81.0	0.0	0.0	30.0	51.0	28.0	53.0	0.0	...	57.0	36.0	0.0	668.0	57.0	0.0	0.0	0.0	37.0	1150.0
National Learn to Code Day 2014 Intro to HTML & CSS: Building a Multi-Page Website	29.0	41.0	43.0	17.0	50.0	34.0	0.0	0.0	51.0	0.0	...	45.0	0.0	0.0	169.0	65.0	81.0	35.0	0.0	0.0	705.0
WordPress for Beginners	28.0	0.0	38.0	0.0	28.0	0.0	0.0	0.0	37.0	0.0	...	0.0	38.0	0.0	295.0	52.0	0.0	0.0	0.0	0.0	568.0
Girls Learning Code Day: Intro to HTML & CSS! (ages 8-13)	29.0	0.0	0.0	31.0	45.0	26.0	0.0	18.0	38.0	0.0	...	29.0	0.0	0.0	0.0	33.0	79.0	30.0	0.0	54.0	454.0
Intro to JavaScript	0.0	19.0	0.0	27.0	0.0	24.0	0.0	0.0	51.0	0.0	...	26.0	0.0	0.0	188.0	46.0	0.0	0.0	0.0	0.0	413.0

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Data Insights with Python for Beginners

Next steps and additional resources for Ladies Learning Code workshop participants

Jennifer Walker | jenfly (at) gmail (dot) com

Table of Contents¶

0. Introduction¶

1. Learning Python fundamentals¶

2. Python data analysis ecosystem¶

3. Setting up your computer¶

3.1 Anaconda¶

3.2 JupyterLab¶

3.3 Using PyCharm with Anaconda¶

4. Pandas demo¶

Example 1: Ladies Learning Code Chapters¶

Example 2: Canada Learning Code Events¶

That covers the exercises we did in the workshop. Now let's see how we could use pandas to explore the data further.¶

5. Data visualization demo¶

6. Additional resources¶

That covers the exercises we did in the workshop. Now let's see how we could use `pandas` to explore the data further.¶