Skip to the content.

Syntax Summary

Pandas

Importing pandas Library

import pandas

In general, it’s good practice to collect all your import commands together and put them at the start of the notebook.

DataFrames and Series

Data in pandas is organized into DataFrames and Series.

Reading a CSV File

To read a CSV file and store it as a DataFrame variable:

df = pandas.read_csv('some_cool_data.csv')

Missing data in a DataFrame or Series is represented as NaN (“not a number”).

Saving to a CSV File

To save a DataFrame to a CSV file:

df.to_csv('cool_output.csv', index=False)

Quick and Easy Summaries of a DataFrame

   
General Overview  
Number of rows and columns; names, data types, and non-null counts for each column; memory usage df.info()
Useful Attributes  
Number of rows and columns (rows first, columns second) df.shape
Names and data types of each column df.dtypes
Just the names of each column df.columns
Rows at a Glance  
First n rows (default 5) df.head(n)
Last n rows (default 5) df.tail(n)
A random sampling of n rows (default 1) df.sample(n)

Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:

df.describe()

Mean value of each column:

df.mean()

And similarly for other summary statistics: df.min(), df.max(), df.median(), df.std()

Selecting Columns

Single Columns

Each column of a DataFrame is a Series.

series_X = df['X']

Most DataFrame methods can be applied to a Series, for example:

df['X'].head()
df['X'].max()

Multiple Columns

Use a list of column names to select several columns of a DataFrame, in a specified order:

df_subset = df[['E', 'A', 'C']]

Selecting Rows

Extracting rows from a DataFrame or Series based on a criteria:

Example:

world_long_life = world[world['life_expectancy'] > 82] 

Selecting Rows and Columns

To select both rows and columns, use .loc[<rows>, <columns>]:

canada_pop = world.loc[world['country'] == 'Canada', 
                       ['country', 'year', 'population']
                      ]

Creating New Columns

df['Double X'] = 2 * df['X']

Unique Values & Counting

For a column df['A'] which contains many repeated values (such as categories), some useful summary methods are:

   
Unique values df['A'].unique()
Number of unique values df['A'].nunique()
Counts of each unique value df['A'].value_counts()

Note: The unique() and value_counts() methods can only be applied to a Series (not a DataFrame)

Sorting

Sorting a DataFrame based on the values in the column 'B':

df.sort_values('B')

To sort in descending order, use the keyword argument ascending=False.

Aggregation

For basic aggregation operations, use the groupby() method chained with an aggregation method (e.g., sum(), mean(), sum(), max(), etc.)

For example, to find the sum totals of column 'population' grouped by column 'region': `

world_2015.groupby('region', as_index=False)['population'].sum()

You can also group by multiple columns:

world_2015.groupby(['region', 'income_group'], as_index=False)['population'].sum()

For more complex aggregations, you can use the agg method:

Specify a list of aggregation statistics, for example:

world_2015.groupby('region', as_index=False)['population'].agg(['sum', 'min', 'max'])

Use a dictionary to specify different aggregation statistics for different columns, for example:

agg_dict = {'population' : 'sum', 
            'life_expectancy' : ['min', 'max']}
world.groupby('region', as_index=False).agg(agg_dict)

Data Visualization

Simple Plots with Pandas

Bar plot:

region_pop.plot(x='region', y='pop_millions', kind='bar');

Scatter plot:

world_2015.plot(x='gdp_per_capita', y='life_expectancy', kind='scatter');

Saving a plot:

ax = region_pop.plot(x='region', y='pop_millions', kind='bar');
ax.get_figure().savefig('figures/region_populations.png', bbox_inches='tight')

Statistical Plots with Seaborn

Import library:

import seaborn as sns

Switch to seaborn default aesthetics:

sns.set_theme()

Example scatter plot with axes customization:

g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region',
                size='pop_millions', sizes=(40, 400), alpha=0.8)
g.set(xscale='log', title='Life Expectancy vs. GDP per Capita in 2015');

Example scatter plot with facets:

income_order= ['Low', 'Lower middle', 'Upper middle', 'High']
g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', col='region',
                col_wrap=3, height=3, hue='income_group', hue_order=income_order)
g.set(xscale='log');

Example line plot with aggregation and facets:

sns.relplot(data=world, x='year', y='pop_millions', hue='income_group', hue_order=income_order,
            style='income_group', kind='line', estimator='sum', ci=None, col='region',
            col_wrap=3, height=3);

Example bar plot with aggregation:

g = sns.catplot(data=world_2015, x='region', y='life_expectancy', kind='bar', aspect=1.5)
g.set(title='Mean Life Expectancy by Region in 2015');

Saving a plot:

g = sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='region',
                size='pop_millions', sizes=(40, 400), alpha=0.8)
g.set(xscale='log', title='Life Expectancy vs. GDP per Capita in 2015');
g.savefig('figures/life_exp_vs_gdp_percap.png')

Interactive Plots with Plotly

Import Plotly Express:

import plotly.express as px

Example scatter plot:

px.scatter(data_frame=world_2015, x='gdp_per_capita', y='life_expectancy', color='region',
           size='pop_millions', size_max=30, log_x=True, hover_data=['country'],
           title='Life Expectancy vs. GDP per Capita in 2015')

Example scatter plot with facets:

px.scatter(data_frame=world_2015, x='gdp_per_capita', y='life_expectancy', 
           facet_col='region', facet_col_wrap=3,
           color='income_group', 
           category_orders={'income_group' : ['Low', 'Lower middle', 'Upper middle', 'High']},
           log_x=True, hover_data=['country'],
           title='Life Expectancy vs. GDP per Capita in 2015')

Example scatter plot with animation:

px.scatter(data_frame=world, x='gdp_per_capita', y='life_expectancy', color='region',
           size='pop_millions', size_max=30, log_x=True, range_y=(20, 90),
           title='Life Expectancy vs. GDP per Capita (1950-2015)',
           animation_frame='year')

Saving a figure:

fig = px.scatter(data_frame=world_2015, x='gdp_per_capita', y='life_expectancy', color='region',
                 size='pop_millions', size_max=30, log_x=True, hover_data=['country'],
                 title='Life Expectancy vs. GDP per Capita in 2015')

# Save to HTML (interactive)
fig.write_html('figures/plotly_life_exp_vs_gdp_percap.html')

# Save as a static PNG image
fig.write_image('figures/plotly_life_exp_vs_gdp_percap.png')

Back to home