Syntax Summary

Pandas

Importing `pandas` Library

import pandas

In general, it’s good practice to collect all your import commands together and put them at the start of the notebook.

DataFrames and Series

Data in pandas is organized into DataFrames and Series.

DataFrame: 2-dimensional array, like a table in a spreadsheet
Series: 1-dimensional array, like a single column or row in a spreadsheet
- Each individual column or row of a DataFrame is represented as a Series

Reading a CSV File

To read a CSV file and store it as a DataFrame variable:

df = pandas.read_csv('some_cool_data.csv')

Missing data in a DataFrame or Series is represented as NaN (“not a number”).

Saving to a CSV File

To save a DataFrame to a CSV file:

df.to_csv('cool_output.csv', index=False)

To include the DataFrame’s index as a column in the CSV file, omit the index=False keyword argument.

Quick and Easy Summaries of a DataFrame


General Overview
Number of rows and columns; names, data types, and non-null counts for each column; memory usage	`df.info()`
Useful Attributes
Number of rows and columns (rows first, columns second)	`df.shape`
Names and data types of each column	`df.dtypes`
Just the names of each column	`df.columns`
Rows at a Glance
First `n` rows (default 5)	`df.head(n)`
Last `n` rows (default 5)	`df.tail(n)`
A random sampling of `n` rows (default 1)	`df.sample(n)`

Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:

df.describe()

Mean value of each column:

df.mean()

And similarly for other summary statistics: df.min(), df.max(), df.median(), df.std()

Selecting Columns

Single Columns

Each column of a DataFrame is a Series.

series_X = df['X']

Most DataFrame methods can be applied to a Series, for example:

df['X'].head()
df['X'].max()

Multiple Columns

Use a list of column names to select several columns of a DataFrame, in a specified order:

df_subset = df[['E', 'A', 'C']]

Selecting Rows

Extracting rows from a DataFrame or Series based on a criteria:

Create a filter (Boolean Series) using a comparison operator or other functions (such as the isin() method)
Use the filter to extract the desired rows from the DataFrame

Example:

world_long_life = world[world['life_expectancy'] > 82] 

Selecting Rows and Columns

To select both rows and columns, use .loc[<rows>, <columns>]:

canada_pop = world.loc[world['country'] == 'Canada', 
                       ['country', 'year', 'population']
                      ]

Creating New Columns

df['Double X'] = 2 * df['X']

Unique Values & Counting

For a column df['A'] which contains many repeated values (such as categories), some useful summary methods are:


Unique values	`df['A'].unique()`
Number of unique values	`df['A'].nunique()`
Counts of each unique value	`df['A'].value_counts()`

Note: The unique() and value_counts() methods can only be applied to a Series (not a DataFrame)

Sorting

Sorting a DataFrame based on the values in the column 'B':

df.sort_values('B')

To sort in descending order, use the keyword argument ascending=False.

Aggregation

For basic aggregation operations, use the groupby() method chained with an aggregation method (e.g., sum(), mean(), sum(), max(), etc.)

Use as_index=False keyword argument to keep the grouping variable as a regular column rather than the index

For example, to find the sum totals of column 'population' grouped by column 'region': `

world_2015.groupby('region', as_index=False)['population'].sum()

You can also group by multiple columns:

world_2015.groupby(['region', 'income_group'], as_index=False)['population'].sum()

For more complex aggregations, you can use the agg method:

Specify a list of aggregation statistics, for example:

world_2015.groupby('region', as_index=False)['population'].agg(['sum', 'min', 'max'])

Use a dictionary to specify different aggregation statistics for different columns, for example:

agg_dict = {'population' : 'sum', 
            'life_expectancy' : ['min', 'max']}
world.groupby('region', as_index=False).agg(agg_dict)

Data Visualization

Simple Plots with Pandas

Bar plot:

region_pop.plot(x='region', y='pop_millions', kind='bar');

Scatter plot:

world_2015.plot(x='gdp_per_capita', y='life_expectancy', kind='scatter');

Saving a plot:

ax = region_pop.plot(x='region', y='pop_millions', kind='bar');
ax.get_figure().savefig('figures/region_populations.png', bbox_inches='tight')

Statistical Plots with Seaborn

Import library:

import seaborn as sns

Switch to seaborn default aesthetics:

sns.set_theme()

Example scatter plot with axes customization:

g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region',
                size='pop_millions', sizes=(40, 400), alpha=0.8)
g.set(xscale='log', title='Life Expectancy vs. GDP per Capita in 2015');

Example scatter plot with facets:

income_order= ['Low', 'Lower middle', 'Upper middle', 'High']
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', col='region',
                col_wrap=3, height=3, hue='income_group', hue_order=income_order)
g.set(xscale='log');

Example line plot with aggregation and facets:

sns.relplot(data=world, x='year', y='pop_millions', hue='income_group', hue_order=income_order,
            style='income_group', kind='line', estimator='sum', ci=None, col='region',
            col_wrap=3, height=3);

Example bar plot with aggregation:

g = sns.catplot(data=world_2015, x='region', y='life_expectancy', kind='bar', aspect=1.5)
g.set(title='Mean Life Expectancy by Region in 2015');