Intro to Data Analysis in Python¶

Software setup instructions
Lessons:
- 0 - Intro to Jupyter
- 1 - Intro to Pandas
- 2 - Intro to Data Visualization
Syntax summary
Solutions to exercises
You can download all the workshop materials as a zipped folder or clone/fork the Github repo
Additional resources

Tentative Schedule¶

1:00 Intro to Jupyter (35 min)
1:35 Intro to Pandas (20 min)
1:55 Break (10 min)
2:05 Intro to Pandas cont'd (45 min)
2:50 Break (10 min)
3:00 Intro to Visualization (60 min)

Lesson 0: Intro to Jupyter¶

What is Jupyter?¶

For this workshop, we will be using Python via Jupyter
You can think of Python like a car’s engine, while Jupyter is like a car’s dashboard
- Python is the programming language that runs computations
- Jupyter is an integrated development environment (IDE) that provides an interface by adding convenient features and tools

engine

Jupyter Notebooks¶

Code, plots, formatted text, equations, etc. in a single document
Run Python code interactively
Also supports R, Julia, Perl, and over 100 other languages (and counting!)

Notebooks are great for exploration and for documenting your workflow
Many options for sharing notebooks in human readable format:
- Share online with nbviewer.jupyter.org
- If you use Github, any notebooks you upload are automatically rendered on the site
- Convert to HTML, PDF, etc. with nbconvert

Classic Jupyter Notebook vs. JupyterLab¶

We'll be using JupyterLab
This tutorial may be a helpful reference as you're finding your way around JupyterLab

Getting Started¶

Let's open JupyterLab and create our first Jupyter notebook! Two options:

Working online on Syzygy
Working locally on your computer

What if I don’t like where my current working directory is?¶

working_directory

Illustration by Allison Horst

Navigating the file system in JupyterLab
Creating new directories (folders)

Organizing Projects¶

It's good practice to keep all the files for a project in one folder, and use sub-folders to keep things organized.

Let's create a new folder for this workshop and call it python-beginner
Within this folder, let's create two sub-folders:
- data
- figures

Create a New Notebook¶

Navigate to your python-beginner folder
Create a new untitled notebook
- Note the .ipynb extension (comes from "interactive Python notebook", the previous name before it was changed to Jupyter to reflect multi-language support)
- Rename the notebook to "workshop.ipynb"
Notebooks auto-save periodically, or you can manually save
You can open a previously saved notebook by clicking on it in Files Sidebar

Working with Notebooks¶

A notebook consists of a series of "cells":

Code cells: execute snippets of code and display the output
Markdown cells: formatted text, equations, images, and more

By default, a new cell is always a code cell.

Code Cells¶

To run a code cell, click in it and press Shift-Enter or press the Run button on the toolbar

In [1]:

print('Hello world!')

Hello world!

In [2]:

2 + 2

Out[2]:

Some handy features:

Auto-complete
Viewing documentation

Markdown Cells¶

In Markdown cells, you can write plain text or add formatting and other elements with Markdown. These include headers, bold text, italic text, hyperlinks, equations $A=\pi r^2$, inline code print('Hello world!'), bulleted lists, and more.

To create a Markdown cell, select an empty cell and change the cell type from "Code" to "Markdown" in the dropdown menu on the toolbar
To run a Markdown cell, press Shift-Enter or the Run button on the toolbar
To edit a Markdown cell, you need to double-click inside it

Other Notebook Basics¶

Organizating cells — insert, delete, cut/copy/paste, move up/down, split, merge
Running all cells or selected cell(s)
Restarting and interrupting the kernel
Caveat: Notebooks are nonlinear and running cells out of order can sometimes lead to unexpected results
- It's good practice to periodically restart the kernel and run all cells, making sure that everything works as expected when you run the whole notebook from top to bottom
Closing vs. shutting down a notebook — kernel process in background
Re-opening a notebook after shutdown
- All the code output is maintained from the previous kernel session
Clear output of all cells or selected cell(s)

Interactivity vs. Automation¶

For a great example of how an interactive workflow in Jupyter notebook can progress into automation with libraries/scripts, check out Jake VanderPlas' blog post Reproducible Data Analysis in Jupyter.

Python Data Science Ecosystem¶

The Python libraries for data science are developed and maintained by external "3rd party" development teams

Python core + 3rd party libraries = ecosystem
To install and manage 3rd party libraries, you need to use a package manager such as conda (which comes with Anaconda/Miniconda)

Some of the libraries in the Python data science ecosystem:

ecosystem_big

From The Unexpected Effectiveness of Python in Science (Jake VanderPlas)

In this workshop, we'll be using pandas to work with tabular data and will give a brief introduction to data visualization with the seaborn and plotly libraries.