solutions

Exercise 1.1¶

a) Initial setup (you can skip to part b if you've already done this):

Import the pandas library
Use pandas.read_csv() to read data from 'https://raw.githubusercontent.com/jenfly/datajam-python/master/data/gapminder.csv' and store it in a DataFrame called world.

In [1]:

import pandas

world = pandas.read_csv('https://raw.githubusercontent.com/jenfly/datajam-python/master/data/gapminder.csv')

In [2]:

world.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2492 entries, 0 to 2491
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                2492 non-null   object 
 1   year                   2492 non-null   int64  
 2   population             2492 non-null   int64  
 3   region                 2492 non-null   object 
 4   sub_region             2492 non-null   object 
 5   income_group           2492 non-null   object 
 6   life_expectancy        2492 non-null   float64
 7   gdp_per_capita         2492 non-null   int64  
 8   children_per_woman     2492 non-null   float64
 9   child_mortality        2492 non-null   float64
 10  pop_density            2492 non-null   float64
 11  years_in_school_men    1780 non-null   float64
 12  years_in_school_women  1780 non-null   float64
dtypes: float64(6), int64(3), object(4)
memory usage: 253.2+ KB

In [3]:

world.describe()

Out[3]:

	year	population	life_expectancy	gdp_per_capita	children_per_woman	child_mortality	pop_density	years_in_school_men	years_in_school_women
count	2492.00000	2.492000e+03	2492.000000	2492.000000	2492.000000	2492.000000	2492.000000	1780.000000	1780.000000
mean	1982.50000	2.667661e+07	62.567135	10502.302167	4.353752	105.670024	118.014013	7.687713	6.959556
std	20.15969	1.037838e+08	11.518029	15478.942158	2.049655	98.615582	372.055683	3.242840	3.932678
min	1950.00000	2.500000e+04	23.800000	247.000000	1.120000	2.200000	0.502000	0.900000	0.210000
25%	1965.00000	1.730000e+06	54.200000	1930.000000	2.360000	24.000000	14.275000	5.097500	3.600000
50%	1982.50000	5.270000e+06	64.650000	4925.000000	4.340000	72.400000	44.850000	7.635000	6.990000
75%	2000.00000	1.600000e+07	71.600000	12700.000000	6.300000	167.000000	108.000000	10.100000	10.000000
max	2015.00000	1.400000e+09	83.800000	178000.000000	8.870000	473.000000	7910.000000	15.300000	15.700000

Exercise 1.2¶

a) Create a new DataFrame called americas which contains the rows of world where the region is "Americas" and has the following columns: country, year, sub_region, income_group, pop_density.

In [4]:

americas = world.loc[world['region'] == 'Americas',
                     ['country', 'year', 'sub_region', 'income_group', 'pop_density']
                    ]

In [5]:

americas.head(20)

Out[5]:

	country	year	sub_region	income_group	pop_density
56	Antigua and Barbuda	1950	Latin America and the Caribbean	High	105.00
57	Antigua and Barbuda	1955	Latin America and the Caribbean	High	120.00
58	Antigua and Barbuda	1960	Latin America and the Caribbean	High	126.00
59	Antigua and Barbuda	1965	Latin America and the Caribbean	High	138.00
60	Antigua and Barbuda	1970	Latin America and the Caribbean	High	152.00
61	Antigua and Barbuda	1975	Latin America and the Caribbean	High	163.00
62	Antigua and Barbuda	1980	Latin America and the Caribbean	High	167.00
63	Antigua and Barbuda	1985	Latin America and the Caribbean	High	159.00
64	Antigua and Barbuda	1990	Latin America and the Caribbean	High	152.00
65	Antigua and Barbuda	1995	Latin America and the Caribbean	High	167.00
66	Antigua and Barbuda	2000	Latin America and the Caribbean	High	190.00
67	Antigua and Barbuda	2005	Latin America and the Caribbean	High	203.00
68	Antigua and Barbuda	2010	Latin America and the Caribbean	High	215.00
69	Antigua and Barbuda	2015	Latin America and the Caribbean	High	227.00
70	Argentina	1950	Latin America and the Caribbean	High	6.27
71	Argentina	1955	Latin America and the Caribbean	High	6.92
72	Argentina	1960	Latin America and the Caribbean	High	7.53
73	Argentina	1965	Latin America and the Caribbean	High	8.14
74	Argentina	1970	Latin America and the Caribbean	High	8.76
75	Argentina	1975	Latin America and the Caribbean	High	9.53

In [6]:

americas.tail(20)

Out[6]:

	country	year	sub_region	income_group	pop_density
2388	Uruguay	1990	Latin America and the Caribbean	High	17.80
2389	Uruguay	1995	Latin America and the Caribbean	High	18.40
2390	Uruguay	2000	Latin America and the Caribbean	High	19.00
2391	Uruguay	2005	Latin America and the Caribbean	High	19.00
2392	Uruguay	2010	Latin America and the Caribbean	High	19.30
2393	Uruguay	2015	Latin America and the Caribbean	High	19.60
2422	Venezuela	1950	Latin America and the Caribbean	Upper middle	6.22
2423	Venezuela	1955	Latin America and the Caribbean	Upper middle	7.66
2424	Venezuela	1960	Latin America and the Caribbean	Upper middle	9.24
2425	Venezuela	1965	Latin America and the Caribbean	Upper middle	11.10
2426	Venezuela	1970	Latin America and the Caribbean	Upper middle	13.10
2427	Venezuela	1975	Latin America and the Caribbean	Upper middle	15.10
2428	Venezuela	1980	Latin America and the Caribbean	Upper middle	17.40
2429	Venezuela	1985	Latin America and the Caribbean	Upper middle	19.80
2430	Venezuela	1990	Latin America and the Caribbean	Upper middle	22.50
2431	Venezuela	1995	Latin America and the Caribbean	Upper middle	25.20
2432	Venezuela	2000	Latin America and the Caribbean	Upper middle	27.80
2433	Venezuela	2005	Latin America and the Caribbean	Upper middle	30.40
2434	Venezuela	2010	Latin America and the Caribbean	Upper middle	32.90
2435	Venezuela	2015	Latin America and the Caribbean	Upper middle	35.30

c) Use the unique() method on the country column to display the list of unique countries in the americas DataFrame.

In [7]:

americas['country'].unique()

Out[7]:

array(['Antigua and Barbuda', 'Argentina', 'Bahamas', 'Barbados',
       'Belize', 'Bolivia', 'Brazil', 'Canada', 'Chile', 'Colombia',
       'Costa Rica', 'Cuba', 'Dominican Republic', 'Ecuador',
       'El Salvador', 'Grenada', 'Guatemala', 'Guyana', 'Haiti',
       'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 'Paraguay',
       'Peru', 'Suriname', 'Trinidad and Tobago', 'United States',
       'Uruguay', 'Venezuela'], dtype=object)

Exercise 1.3¶

For this exercise we're working with the original DataFrame world (containing all years and all countries).

a) Initial setup (you can skip to part b if you've already done this):

Compute the population in millions: divide world['population'] by 1e6 and assign the result to world['pop_millions'].

In [8]:

world['pop_millions'] = world['population'] / 1e6

In [9]:

world.groupby('year', as_index=False)['pop_millions'].sum()

Out[9]:

	year	pop_millions
0	1950	2521.5914
1	1955	2755.4391
2	1960	3014.5238
3	1965	3317.6620
4	1970	3676.8109
5	1975	4052.1130
6	1980	4428.6840
7	1985	4841.1945
8	1990	5294.2122
9	1995	5714.3521
10	2000	6101.9393
11	2005	6495.9793
12	2010	6918.4071
13	2015	7345.2106

Exercise 2.1¶

a) Initial setup (you can skip to part b if you've already done this):

Import the seaborn library and give it the nickname sns
Call sns.set_theme() (turns on seaborn styling)
Create a DataFrame called world_2015 which contains only the rows of world where the column year is equal to 2015.

In [10]:

import seaborn as sns

sns.set_theme()

world_2015 = world[world['year'] == 2015]

b) Use relplot() to create a scatter plot of life_expectancy vs. gdp_per_capita from world_2015, in which the points are coloured by income_group.

In [11]:

sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='income_group');

In [12]:

sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='income_group',
            aspect=1.5);

In [13]:

income_order= ['Low', 'Lower middle', 'Upper middle', 'High']

b) Use relplot() to create a plot similar to the previous example, but plotting life_expectancy on the y-axis instead of pop_millions and aggregating with the mean instead of the sum.

We want to aggregate with the mean instead of the sum, so you'll need to use the keyword argument estimator='mean'.
Other aspects of the plot are the same as the previous example: use the world DataFrame, year on the x-axis, income_group maps to line colour and style, income_order for the hue_order argument, and facetting on region.

In [14]:

sns.relplot(data=world, x='year', y='life_expectancy', hue='income_group', 
            hue_order=income_order, style='income_group', kind='line', 
            estimator='mean', ci=None, col='region',
            col_wrap=3, height=3);

Bonus: Do you spot anything strange in the subplot for the "Americas" region? How could you investigate this using the techniques we learned in the Intro to Pandas lesson?

There appears to be an outlier in the Americas low income group in 2010. To investigate, we can look at the rows of world for this group, to see if anything jumps out:

In [15]:

world[(world['income_group'] == 'Low') & (world['region'] == 'Americas')]

Out[15]:

	country	year	population	region	sub_region	income_group	life_expectancy	gdp_per_capita	children_per_woman	child_mortality	pop_density	years_in_school_men	years_in_school_women	pop_millions
938	Haiti	1950	3220000	Americas	Latin America and the Caribbean	Low	35.3	2290	6.31	352.0	117.0	NaN	NaN	3.22
939	Haiti	1955	3510000	Americas	Latin America and the Caribbean	Low	38.0	2260	6.30	313.0	128.0	NaN	NaN	3.51
940	Haiti	1960	3870000	Americas	Latin America and the Caribbean	Low	40.9	2300	6.32	286.0	140.0	NaN	NaN	3.87
941	Haiti	1965	4270000	Americas	Latin America and the Caribbean	Low	43.6	2010	6.18	264.0	155.0	NaN	NaN	4.27
942	Haiti	1970	4710000	Americas	Latin America and the Caribbean	Low	45.6	1980	5.76	241.0	171.0	2.93	1.73	4.71
943	Haiti	1975	5140000	Americas	Latin America and the Caribbean	Low	48.3	2260	5.64	215.0	187.0	3.34	2.05	5.14
944	Haiti	1980	5690000	Americas	Latin America and the Caribbean	Low	50.4	2860	6.06	190.0	206.0	3.81	2.42	5.69
945	Haiti	1985	6380000	Americas	Latin America and the Caribbean	Low	52.5	2480	6.03	166.0	232.0	4.34	2.87	6.38
946	Haiti	1990	7100000	Americas	Latin America and the Caribbean	Low	54.1	2250	5.43	145.0	258.0	4.93	3.40	7.10
947	Haiti	1995	7820000	Americas	Latin America and the Caribbean	Low	55.1	1680	4.89	124.0	284.0	5.59	4.04	7.82
948	Haiti	2000	8550000	Americas	Latin America and the Caribbean	Low	56.8	1740	4.30	105.0	310.0	6.21	4.69	8.55
949	Haiti	2005	9260000	Americas	Latin America and the Caribbean	Low	58.0	1560	3.76	89.8	336.0	6.86	5.39	9.26
950	Haiti	2010	10000000	Americas	Latin America and the Caribbean	Low	32.1	1500	3.33	208.0	363.0	7.52	6.12	10.00
951	Haiti	2015	10700000	Americas	Latin America and the Caribbean	Low	63.9	1650	2.97	68.9	389.0	8.20	6.90	10.70

We can see that the Americas low income group only contains one country, Haiti, and during 2010 there was a large drop in life expectancy. Since there was a devastating earthquake in Haiti in 2010, this drop in life expectancy is likely reflecting these disaster conditions. However, the magnitude of the decrease (from 58 down to 32 years) seems potentially unrealistic and could indicate an issue in how life expectancy was calculated in the Gapminder data.

Back to home

Solutions to Exercises¶

Lesson 1: Intro to Pandas¶

Exercise 1.1¶

Exercise 1.2¶

Exercise 1.3¶

Lesson 2: Intro to Data Visualization¶

Exercise 2.1¶

Exercise 2.2¶