a) Initial setup (you can skip to part b if you've already done this):
pandas
librarypandas.read_csv()
to read data from 'https://raw.githubusercontent.com/jenfly/datajam-python/master/data/gapminder.csv'
and store it in a DataFrame called world
.import pandas
world = pandas.read_csv('https://raw.githubusercontent.com/jenfly/datajam-python/master/data/gapminder.csv')
b) Based on the output of world.info()
, what data type is the pop_density
column?
world.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2492 entries, 0 to 2491 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 2492 non-null object 1 year 2492 non-null int64 2 population 2492 non-null int64 3 region 2492 non-null object 4 sub_region 2492 non-null object 5 income_group 2492 non-null object 6 life_expectancy 2492 non-null float64 7 gdp_per_capita 2492 non-null int64 8 children_per_woman 2492 non-null float64 9 child_mortality 2492 non-null float64 10 pop_density 2492 non-null float64 11 years_in_school_men 1780 non-null float64 12 years_in_school_women 1780 non-null float64 dtypes: float64(6), int64(3), object(4) memory usage: 253.2+ KB
The pop_density
column is of float type.
c) Based on the output of world.describe()
, what are the minimum and maximum years in this data?
world.describe()
year | population | life_expectancy | gdp_per_capita | children_per_woman | child_mortality | pop_density | years_in_school_men | years_in_school_women | |
---|---|---|---|---|---|---|---|---|---|
count | 2492.00000 | 2.492000e+03 | 2492.000000 | 2492.000000 | 2492.000000 | 2492.000000 | 2492.000000 | 1780.000000 | 1780.000000 |
mean | 1982.50000 | 2.667661e+07 | 62.567135 | 10502.302167 | 4.353752 | 105.670024 | 118.014013 | 7.687713 | 6.959556 |
std | 20.15969 | 1.037838e+08 | 11.518029 | 15478.942158 | 2.049655 | 98.615582 | 372.055683 | 3.242840 | 3.932678 |
min | 1950.00000 | 2.500000e+04 | 23.800000 | 247.000000 | 1.120000 | 2.200000 | 0.502000 | 0.900000 | 0.210000 |
25% | 1965.00000 | 1.730000e+06 | 54.200000 | 1930.000000 | 2.360000 | 24.000000 | 14.275000 | 5.097500 | 3.600000 |
50% | 1982.50000 | 5.270000e+06 | 64.650000 | 4925.000000 | 4.340000 | 72.400000 | 44.850000 | 7.635000 | 6.990000 |
75% | 2000.00000 | 1.600000e+07 | 71.600000 | 12700.000000 | 6.300000 | 167.000000 | 108.000000 | 10.100000 | 10.000000 |
max | 2015.00000 | 1.400000e+09 | 83.800000 | 178000.000000 | 8.870000 | 473.000000 | 7910.000000 | 15.300000 | 15.700000 |
The minimum year is 1950 and the maximum year is 2015.
a) Create a new DataFrame called americas
which contains the rows of world
where the region is "Americas" and has the following columns: country
, year
, sub_region
, income_group
, pop_density
.
americas = world.loc[world['region'] == 'Americas',
['country', 'year', 'sub_region', 'income_group', 'pop_density']
]
b) Use the head()
and tail()
methods to display the first 20 and last 20 rows.
americas.head(20)
country | year | sub_region | income_group | pop_density | |
---|---|---|---|---|---|
56 | Antigua and Barbuda | 1950 | Latin America and the Caribbean | High | 105.00 |
57 | Antigua and Barbuda | 1955 | Latin America and the Caribbean | High | 120.00 |
58 | Antigua and Barbuda | 1960 | Latin America and the Caribbean | High | 126.00 |
59 | Antigua and Barbuda | 1965 | Latin America and the Caribbean | High | 138.00 |
60 | Antigua and Barbuda | 1970 | Latin America and the Caribbean | High | 152.00 |
61 | Antigua and Barbuda | 1975 | Latin America and the Caribbean | High | 163.00 |
62 | Antigua and Barbuda | 1980 | Latin America and the Caribbean | High | 167.00 |
63 | Antigua and Barbuda | 1985 | Latin America and the Caribbean | High | 159.00 |
64 | Antigua and Barbuda | 1990 | Latin America and the Caribbean | High | 152.00 |
65 | Antigua and Barbuda | 1995 | Latin America and the Caribbean | High | 167.00 |
66 | Antigua and Barbuda | 2000 | Latin America and the Caribbean | High | 190.00 |
67 | Antigua and Barbuda | 2005 | Latin America and the Caribbean | High | 203.00 |
68 | Antigua and Barbuda | 2010 | Latin America and the Caribbean | High | 215.00 |
69 | Antigua and Barbuda | 2015 | Latin America and the Caribbean | High | 227.00 |
70 | Argentina | 1950 | Latin America and the Caribbean | High | 6.27 |
71 | Argentina | 1955 | Latin America and the Caribbean | High | 6.92 |
72 | Argentina | 1960 | Latin America and the Caribbean | High | 7.53 |
73 | Argentina | 1965 | Latin America and the Caribbean | High | 8.14 |
74 | Argentina | 1970 | Latin America and the Caribbean | High | 8.76 |
75 | Argentina | 1975 | Latin America and the Caribbean | High | 9.53 |
americas.tail(20)
country | year | sub_region | income_group | pop_density | |
---|---|---|---|---|---|
2388 | Uruguay | 1990 | Latin America and the Caribbean | High | 17.80 |
2389 | Uruguay | 1995 | Latin America and the Caribbean | High | 18.40 |
2390 | Uruguay | 2000 | Latin America and the Caribbean | High | 19.00 |
2391 | Uruguay | 2005 | Latin America and the Caribbean | High | 19.00 |
2392 | Uruguay | 2010 | Latin America and the Caribbean | High | 19.30 |
2393 | Uruguay | 2015 | Latin America and the Caribbean | High | 19.60 |
2422 | Venezuela | 1950 | Latin America and the Caribbean | Upper middle | 6.22 |
2423 | Venezuela | 1955 | Latin America and the Caribbean | Upper middle | 7.66 |
2424 | Venezuela | 1960 | Latin America and the Caribbean | Upper middle | 9.24 |
2425 | Venezuela | 1965 | Latin America and the Caribbean | Upper middle | 11.10 |
2426 | Venezuela | 1970 | Latin America and the Caribbean | Upper middle | 13.10 |
2427 | Venezuela | 1975 | Latin America and the Caribbean | Upper middle | 15.10 |
2428 | Venezuela | 1980 | Latin America and the Caribbean | Upper middle | 17.40 |
2429 | Venezuela | 1985 | Latin America and the Caribbean | Upper middle | 19.80 |
2430 | Venezuela | 1990 | Latin America and the Caribbean | Upper middle | 22.50 |
2431 | Venezuela | 1995 | Latin America and the Caribbean | Upper middle | 25.20 |
2432 | Venezuela | 2000 | Latin America and the Caribbean | Upper middle | 27.80 |
2433 | Venezuela | 2005 | Latin America and the Caribbean | Upper middle | 30.40 |
2434 | Venezuela | 2010 | Latin America and the Caribbean | Upper middle | 32.90 |
2435 | Venezuela | 2015 | Latin America and the Caribbean | Upper middle | 35.30 |
c) Use the unique()
method on the country
column to display the list of unique countries in the americas
DataFrame.
americas['country'].unique()
array(['Antigua and Barbuda', 'Argentina', 'Bahamas', 'Barbados', 'Belize', 'Bolivia', 'Brazil', 'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Cuba', 'Dominican Republic', 'Ecuador', 'El Salvador', 'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 'Paraguay', 'Peru', 'Suriname', 'Trinidad and Tobago', 'United States', 'Uruguay', 'Venezuela'], dtype=object)
For this exercise we're working with the original DataFrame world
(containing all years and all countries).
a) Initial setup (you can skip to part b if you've already done this):
world['population']
by 1e6
and assign the result to world['pop_millions']
. world['pop_millions'] = world['population'] / 1e6
b) Group the DataFrame world
by year and compute the world total population (in millions) in each year.
world.groupby('year', as_index=False)['pop_millions'].sum()
year | pop_millions | |
---|---|---|
0 | 1950 | 2521.5914 |
1 | 1955 | 2755.4391 |
2 | 1960 | 3014.5238 |
3 | 1965 | 3317.6620 |
4 | 1970 | 3676.8109 |
5 | 1975 | 4052.1130 |
6 | 1980 | 4428.6840 |
7 | 1985 | 4841.1945 |
8 | 1990 | 5294.2122 |
9 | 1995 | 5714.3521 |
10 | 2000 | 6101.9393 |
11 | 2005 | 6495.9793 |
12 | 2010 | 6918.4071 |
13 | 2015 | 7345.2106 |
a) Initial setup (you can skip to part b if you've already done this):
seaborn
library and give it the nickname sns
sns.set_theme()
(turns on seaborn
styling)world_2015
which contains only the rows of world
where the column year
is equal to 2015.import seaborn as sns
sns.set_theme()
world_2015 = world[world['year'] == 2015]
b) Use relplot()
to create a scatter plot of life_expectancy
vs. gdp_per_capita
from world_2015
, in which the points are coloured by income_group
.
sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='income_group');
c) Add the keyword argument aspect=1.5
to the relplot()
function call. How does the plot change?
sns.relplot(data=world_2015, x='gdp_per_capita', y='life_expectancy', hue='income_group',
aspect=1.5);
The aspect ratio changes (the plot becomes wider).
a) Initial setup (you can skip to part b if you've already done this):
income_order
which contains the following strings: 'Low', 'Lower middle', 'Upper middle', 'High'income_order= ['Low', 'Lower middle', 'Upper middle', 'High']
b) Use relplot()
to create a plot similar to the previous example, but plotting life_expectancy
on the y-axis instead of pop_millions
and aggregating with the mean instead of the sum.
estimator='mean'
.world
DataFrame, year
on the x-axis, income_group
maps to line colour and style, income_order
for the hue_order
argument, and facetting on region
.sns.relplot(data=world, x='year', y='life_expectancy', hue='income_group',
hue_order=income_order, style='income_group', kind='line',
estimator='mean', ci=None, col='region',
col_wrap=3, height=3);
Bonus: Do you spot anything strange in the subplot for the "Americas" region? How could you investigate this using the techniques we learned in the Intro to Pandas lesson?
There appears to be an outlier in the Americas low income group in 2010. To investigate, we can look at the rows of world
for this group, to see if anything jumps out:
world[(world['income_group'] == 'Low') & (world['region'] == 'Americas')]
country | year | population | region | sub_region | income_group | life_expectancy | gdp_per_capita | children_per_woman | child_mortality | pop_density | years_in_school_men | years_in_school_women | pop_millions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
938 | Haiti | 1950 | 3220000 | Americas | Latin America and the Caribbean | Low | 35.3 | 2290 | 6.31 | 352.0 | 117.0 | NaN | NaN | 3.22 |
939 | Haiti | 1955 | 3510000 | Americas | Latin America and the Caribbean | Low | 38.0 | 2260 | 6.30 | 313.0 | 128.0 | NaN | NaN | 3.51 |
940 | Haiti | 1960 | 3870000 | Americas | Latin America and the Caribbean | Low | 40.9 | 2300 | 6.32 | 286.0 | 140.0 | NaN | NaN | 3.87 |
941 | Haiti | 1965 | 4270000 | Americas | Latin America and the Caribbean | Low | 43.6 | 2010 | 6.18 | 264.0 | 155.0 | NaN | NaN | 4.27 |
942 | Haiti | 1970 | 4710000 | Americas | Latin America and the Caribbean | Low | 45.6 | 1980 | 5.76 | 241.0 | 171.0 | 2.93 | 1.73 | 4.71 |
943 | Haiti | 1975 | 5140000 | Americas | Latin America and the Caribbean | Low | 48.3 | 2260 | 5.64 | 215.0 | 187.0 | 3.34 | 2.05 | 5.14 |
944 | Haiti | 1980 | 5690000 | Americas | Latin America and the Caribbean | Low | 50.4 | 2860 | 6.06 | 190.0 | 206.0 | 3.81 | 2.42 | 5.69 |
945 | Haiti | 1985 | 6380000 | Americas | Latin America and the Caribbean | Low | 52.5 | 2480 | 6.03 | 166.0 | 232.0 | 4.34 | 2.87 | 6.38 |
946 | Haiti | 1990 | 7100000 | Americas | Latin America and the Caribbean | Low | 54.1 | 2250 | 5.43 | 145.0 | 258.0 | 4.93 | 3.40 | 7.10 |
947 | Haiti | 1995 | 7820000 | Americas | Latin America and the Caribbean | Low | 55.1 | 1680 | 4.89 | 124.0 | 284.0 | 5.59 | 4.04 | 7.82 |
948 | Haiti | 2000 | 8550000 | Americas | Latin America and the Caribbean | Low | 56.8 | 1740 | 4.30 | 105.0 | 310.0 | 6.21 | 4.69 | 8.55 |
949 | Haiti | 2005 | 9260000 | Americas | Latin America and the Caribbean | Low | 58.0 | 1560 | 3.76 | 89.8 | 336.0 | 6.86 | 5.39 | 9.26 |
950 | Haiti | 2010 | 10000000 | Americas | Latin America and the Caribbean | Low | 32.1 | 1500 | 3.33 | 208.0 | 363.0 | 7.52 | 6.12 | 10.00 |
951 | Haiti | 2015 | 10700000 | Americas | Latin America and the Caribbean | Low | 63.9 | 1650 | 2.97 | 68.9 | 389.0 | 8.20 | 6.90 | 10.70 |
We can see that the Americas low income group only contains one country, Haiti, and during 2010 there was a large drop in life expectancy. Since there was a devastating earthquake in Haiti in 2010, this drop in life expectancy is likely reflecting these disaster conditions. However, the magnitude of the decrease (from 58 down to 32 years) seems potentially unrealistic and could indicate an issue in how life expectancy was calculated in the Gapminder data.
Back to home