17. Pandas - Data Science with Python#
Numpy and numpy arrays are our tool of choice for numeric data that resembles vectors, matrices (and higher dimensional tensors).
Where data is gathered from experiments, and in particular where we want to extract meaning from the combination of different data sources, and where data is often incomplete, the pandas library offers a number of useful tools (and has become a standard tool for data scientists).
In this section, we introduce the basics of Pandas.
In particular, we introduce the two key data types in Pandas: the Series
and the DataFrame
objects.
By convention, the pandas
library is imported under the name pd
(the same way that numpy
is imported under the name np
:
import pandas as pd
17.1. Motivational example (Series)#
Imagine we are working on software for a greengrocer or supermarket, and need to track the number of apples (10), oranges(3) and bananas (22) that are available in the supermarket.
We could use a python list (or a numpy array) to track these numbers:
stock = [10, 3, 22]
However, we would need to remember separately that the entries are in the order of apples, oranges, and bananas. This could be achieved through a second list:
stocklabels = ['apple', 'orange', 'banana']
assert len(stocklabels) == len(stock) # check labels and
# stock are consistent
for label, count in zip(stocklabels, stock):
print(f'{label:10s} : {count:4d}')
apple : 10
orange : 3
banana : 22
The above 2-list solution is a little awkward in two ways: firstly, we have use two lists to describe one set of data (and thus need to be carefuly to update them simulatenously, for example), and secondly, the access to the data given a label is inconvenient: We need to find the index of the label with one list, then use this as the index to the other list, for example
index = stocklabels.index('banana')
bananas = stock[index]
print(f"There are {bananas} bananas [index={index}].")
There are 22 bananas [index=2].
We have come across similar examples in the section on dictionaries, and indeed a dictionary is a more convenient solution:
stock_dic = {'apple': 10,
'orange': 3,
'banana': 22}
The keys of the dictionary contain the stock labels and the values contain the actual values:
stock_dic.keys()
dict_keys(['apple', 'orange', 'banana'])
stock_dic.values()
dict_values([10, 3, 22])
To retrieve (or change) the value for apple
, we use apple
as the key and retrieve the value through the dictionary’s indexing notation:
stock_dic['apple']
10
And we can summarise the stock as follows:
for label in stock_dic:
print(f'{label:10s} : {stock_dic[label]:4d}')
apple : 10
orange : 3
banana : 22
This is a vast improvement over the 2-lists solution: (i) we only maintain one structure, which contains a value for every key - so we don’t need to check that the lists have the same length. (ii) we can access individual elements through the label (using it as a key for the dictionary).
The Pandas Series object address the requirements above. It is similar to a dictionary, but with improvements for the given problem:
the order of the items is maintained
the values have to have the same type (higher execution performance)
a (large) number of convenience functionality, for example to deal with missing data, time series, sorting, plotting, and more
17.2. Pandas Series
#
17.2.1. Stock example - Series
#
We can create a Series
object - for example - from a dictionary:
stock = pd.Series({'apple': 10,
'orange': 3,
'banana': 22})
The default presentation shows the entries one per row, with the label on the left, and the value on the right.
type(stock)
pandas.core.series.Series
stock
apple 10
orange 3
banana 22
dtype: int64
The items on the left are referred to as the index
of the Series, and are available as the index
attribute of the series
object:
stock.index
Index(['apple', 'orange', 'banana'], dtype='object')
type(stock.index)
pandas.core.indexes.base.Index
We can also access the list of values for each item, using the values
attribute:
stock.values
array([10, 3, 22])
Regarding data access, the Series
object behaves like a dictionary:
stock['apple']
10
stock['potato'] = 101 # adding more values
stock['cucumber'] = 1
print(stock)
apple 10
orange 3
banana 22
potato 101
cucumber 1
dtype: int64
stock
apple 10
orange 3
banana 22
potato 101
cucumber 1
dtype: int64
We can plot the data as a bar chart:
%matplotlib inline
# settings for Jupyter book: svg for html version, high-resolution png for pdf
import matplotlib
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg', 'png')
matplotlib.rcParams['figure.dpi'] = 400
stock.plot(kind='bar')
<Axes: >
We can sort the data according to the values in the Series (and then plot to visualise):
stock.sort_values().plot(kind='bar')
<Axes: >
Or sort the index to get alphabetical order of our fruit and vegetables:
stock.sort_index().plot(kind='bar')
<Axes: >
The Series
object has a number of numerical methods available, including mean
and sum
:
stock.sum()
137
stock.mean()
27.4
It also behaves like a sequence in that the len
function returns the number of data points in the Series object:
len(stock)
5
17.2.2. memory usage#
For larger data sets, it might be important to know how many bytes storing the Series costs. The bytes required to store the actual series data are available as
stock.nbytes
40
or from the underlying numpy array directly:
stock.values.nbytes
40
It is 40 bytes, because we have 5 elements stored as int64 (each needing 8 bytes):
stock.dtype
dtype('int64')
The Series object needs additional memory. This can be queried using:
stock.memory_usage()
252
17.2.3. Statistics#
A number of statistical descriptors of the data in the stock
Series object is available using describe()
:
stock.describe()
count 5.000000
mean 27.400000
std 41.955929
min 1.000000
25% 3.000000
50% 10.000000
75% 22.000000
max 101.000000
dtype: float64
As usual, the documentation strings provide documentation (help(stock.describe)
), and the pandas home page (https://pandas.pydata.org
) provides links to the Pandas documentation.
17.3. Create Series from list#
In the example above, we showed how to create a Series from a dictionary where the keys of the dictionary entries served as the index for the Series object.
We can also create a Series from a list, an provide an additional index:
stock = pd.Series([10, 3, 22], index=['apple', 'orange', 'banana'])
stock
apple 10
orange 3
banana 22
dtype: int64
If we omit the index
argument, the Series will assume an integer index:
stock = pd.Series([10, 3, 22])
stock
0 10
1 3
2 22
dtype: int64
However, an index can be added subsequently:
stock.index = ['apple', 'orange', 'banana']
stock
apple 10
orange 3
banana 22
dtype: int64
17.4. Plotting data#
Commonly used plots are easily accessible via the plot()
method of the Series object. We have seen a bar plot above already. The Series.plot()
method accepts an argument kind
such as kind="bar"
, but there is an equivalent method Series.plot.bar()
available.
Further examples:
stock.plot.pie()
<Axes: >
To tailor the plot, we can either get the axis object and modify it subsequently:
ax = stock.plot.pie()
ax.set_aspect(1)
ax.set_ylabel(None);
ax.set_title(None);
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(9, 3))
stock.plot.bar(ax=ax)
ax.set_title("Current stock");
We can also retrieve the data from the series and drive the plotting “manually” ourselves:
import matplotlib.pyplot as plt
names = list(stock.index) # conversion to list not necessary
values = list(stock.values) # conversion to list not necessary
fig, ax = plt.subplots(1, 1, figsize=(9, 3))
ax.bar(names, values)
ax.set_title('Stock');
17.5. Missing values#
“Real” data sets tend to be incomplete. Dealing with missing values is an important topic in data science. The agreement in Pandas is that the special floating point value “NaN” (standing for N
ot a
N
umber) represents missing data points. For example, if we have a table for the stock, but we don’t know the value for apple
, we would replace it with NaN
.
The special Nan
value in Python can be created using float('nan')
or using numpy.nan
if the numpy
module is imported.
stock['apple'] = float('nan')
stock
apple NaN
orange 3.0
banana 22.0
dtype: float64
Note that the dtype
of the stock
Series object has changed from int64
to float64
when we assigned NaN
to apple
: the whole series has been converted to float, because NaN
is only defined for floating point numbers.
(There is a proposal to create a NaN
object as part of pandas - this would overcome the above limitation.)
Assume we need to calculate how many items of stock we have in total using the sum
function:
stock.values
array([nan, 3., 22.])
A common situation is that we have an incomplete Series or DataFrame (which are multiple Series with the same index) and we want to process with our analysis, but treat the missing values in a special way.
stock.sum()
25.0
The above example sum
shows that NaN
values are simply ignored, which can be convenient.
We can also ‘tidy up’ the Series object, by removing all entries that have a NaN
value:
stock.dropna()
orange 3.0
banana 22.0
dtype: float64
17.6. Series data access: explicit and implicit (loc
and iloc
)#
stock = pd.Series({'apple': 10,
'orange': 3,
'banana': 22,
'cucumber' : 1,
'potato' : 110})
stock
apple 10
orange 3
banana 22
cucumber 1
potato 110
dtype: int64
17.6.1. Indexing#
We can access single values through their index as if the stock Series object would be a dictionary:
stock['banana']
22
There is an equivalent and recommended way of using this retrieval using the loc
(for LOCation?) attribute:
stock.loc['banana']
22
For convenience, pandas also (!) allows us to use integer indexing into the Series object. This is called implicit indexing as the series Object doesn’t use integers as the index, but the name of the fruits.
For example, we can also retrieve the value for banana
through its implicit index 2, because it is in row 3 of the Series object (which would need index 2 as we start counting from 0):
stock[2]
/tmp/ipykernel_293/4060357290.py:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
stock[2]
22
In this example, this works fine and seems convenient, but can become very confusing if the actual index of the object consists of integers. For that reason, the explicit (and recommended way) of using the indirect indexing is through the iloc
(ImplicitLOCation) attribute:
stock.iloc[2]
22
17.6.2. Slicing#
stock
apple 10
orange 3
banana 22
cucumber 1
potato 110
dtype: int64
We can also slice the Series:
stock['orange':'potato']
orange 3
banana 22
cucumber 1
potato 110
dtype: int64
Or skip every second entry:
stock['orange':'potato':2]
orange 3
cucumber 1
dtype: int64
17.6.3. Data manipulation#
Numerical operations on the series object can be carried for all data values at the same time inthe same way that numpy arrays are processed:
stock - stock.mean()
apple -19.2
orange -26.2
banana -7.2
cucumber -28.2
potato 80.8
dtype: float64
import numpy as np
np.sqrt(stock)
apple 3.162278
orange 1.732051
banana 4.690416
cucumber 1.000000
potato 10.488088
dtype: float64
Where preferred, we can extract the numpy array and work with that:
data = stock.values
type(data)
numpy.ndarray
data - data.mean()
array([-19.2, -26.2, -7.2, -28.2, 80.8])
17.6.4. Import and Export#
Pandas (and its objects Series
and DataFrame
) support export to and import from a number of useful formats.
For example, we can write a Series
object into a comma separated value file:
stock.to_csv('stock.csv', header=False)
#NBVAL_IGNORE_OUTPUT
!cat stock.csv
apple,10
orange,3
banana,22
cucumber,1
potato,110
We can also create a LaTeX representation of the table:
stock.to_latex()
'\\begin{tabular}{lr}\n\\toprule\n & 0 \\\\\n\\midrule\napple & 10 \\\\\norange & 3 \\\\\nbanana & 22 \\\\\ncucumber & 1 \\\\\npotato & 110 \\\\\n\\bottomrule\n\\end{tabular}\n'
We’ll come back to reading from files in the DataFrame
section.
17.7. Data Frame#
17.7.1. Stock Example - DataFrame
#
After having introduced the Series
object above, we will focus on the second important type in pandas: the DataFrame
.
As a first description, we could say that the DataFrame
is similar to a (2d) spreadsheet: it contains rows and columns.
The series object we have studied above is a special case of the DataFrame
, where the DataFrame
has only one column.
We’ll continue with our stock example:
stock
apple 10
orange 3
banana 22
cucumber 1
potato 110
dtype: int64
In addition to tracking how many objects of each type we have stocked, we have a second Series object that provides the price per item at which the item is sold:
price = pd.Series({'apple': 0.55, 'banana': 0.50, 'cucumber' : 0.99, 'potato' : 0.17, 'orange': 1.76})
price
apple 0.55
banana 0.50
cucumber 0.99
potato 0.17
orange 1.76
dtype: float64
The DataFrame
object allows us to treat the two series together. In fact, a convenient way to create the DataFrame
object is to combine a number of series as follows:
shop = pd.DataFrame({'stock' : stock, 'price' : price})
shop
stock | price | |
---|---|---|
apple | 10 | 0.55 |
banana | 22 | 0.50 |
cucumber | 1 | 0.99 |
orange | 3 | 1.76 |
potato | 110 | 0.17 |
Because both Series
objects had the same index
elements, our data is nicely aligned in the DataFrame
with name shop
, even though the data was stored in different order in the price
and stock
.
If one Series is missing a data point, pandas will insert a NaN
entry into that field:
price2 = price.copy()
price2['grapefruit'] = 1.99
price2
apple 0.55
banana 0.50
cucumber 0.99
potato 0.17
orange 1.76
grapefruit 1.99
dtype: float64
pd.DataFrame({'stock' : stock, 'price' : price2})
stock | price | |
---|---|---|
apple | 10.0 | 0.55 |
banana | 22.0 | 0.50 |
cucumber | 1.0 | 0.99 |
grapefruit | NaN | 1.99 |
orange | 3.0 | 1.76 |
potato | 110.0 | 0.17 |
17.7.2. Accessing data in a DataFramea#
shop
stock | price | |
---|---|---|
apple | 10 | 0.55 |
banana | 22 | 0.50 |
cucumber | 1 | 0.99 |
orange | 3 | 1.76 |
potato | 110 | 0.17 |
The data frame has an index which is the same for all columns, and shown in bold in the left most column. We can also ask for it:
shop.index
Index(['apple', 'banana', 'cucumber', 'orange', 'potato'], dtype='object')
Each column has name (here stock
and price
):
shop.columns
Index(['stock', 'price'], dtype='object')
17.7.3. Extracting columns of data#
Using the column names, we can extract one column into a Series object using the index operator ([]
):
shop['stock']
apple 10
banana 22
cucumber 1
orange 3
potato 110
Name: stock, dtype: int64
shop['price']
apple 0.55
banana 0.50
cucumber 0.99
orange 1.76
potato 0.17
Name: price, dtype: float64
17.7.4. Extracting rows of data#
We have two options of extracting a row of data.
First, explicit indexing using the label of the index in that row:
shop.loc['apple'] # single row is returned as series
stock 10.00
price 0.55
Name: apple, dtype: float64
shop.loc['banana':'cucumber'] # multiple rows are returned as DataFrame
stock | price | |
---|---|---|
banana | 22 | 0.50 |
cucumber | 1 | 0.99 |
Second, we can use the implicit indexing (as for Series objects):
shop.iloc[0]
stock 10.00
price 0.55
Name: apple, dtype: float64
shop.iloc[1:3]
stock | price | |
---|---|---|
banana | 22 | 0.50 |
cucumber | 1 | 0.99 |
Warning
Note that there are some inconsistencies here: the explicit slicing with index labels (such as .loc['banana':'cucumber']
) is inclusive of cucumber
, whereas in the implicit slicing (such as .iloc[1:3]
) the row with index 3
is not included.
The behaviour of .loc
is convenient and a good design choice if labels such as strings in our stock
example are used. The behaviour of .iloc
is reflecting the normal Python behaviour.
It is thus understandable how we have arrived at the situation.
17.7.5. Data manipulation with shop
#
The real strength of the DataFrames is that we can continue to process the data conveniently.
For example, we could work out the financial value of the items we have in stock, and add this as an extra column:
shop['value'] = shop['price'] * shop['stock']
shop
stock | price | value | |
---|---|---|---|
apple | 10 | 0.55 | 5.50 |
banana | 22 | 0.50 | 11.00 |
cucumber | 1 | 0.99 | 0.99 |
orange | 3 | 1.76 | 5.28 |
potato | 110 | 0.17 | 18.70 |
Of course we can compute the sum, for example, to estimate the value of the total stock:
shop['value'].sum()
41.47
If, for whatever reason, we want to swap columns with rows, we can transpose
the data frame like a numpy array:
shop.transpose()
apple | banana | cucumber | orange | potato | |
---|---|---|---|---|---|
stock | 10.00 | 22.0 | 1.00 | 3.00 | 110.00 |
price | 0.55 | 0.5 | 0.99 | 1.76 | 0.17 |
value | 5.50 | 11.0 | 0.99 | 5.28 | 18.70 |
17.8. Example: European population 2017#
Here is a second example to demonstrate some use cases of pandas DataFrames.
First, we get the data. It is originally from EUROSTAT (reference “demo_gind”)
#NBVAL_IGNORE_OUTPUT
!wget https://fangohr.github.io/data/eurostat/population2017/eu-pop-2017.csv
--2024-04-14 10:28:17-- https://fangohr.github.io/data/eurostat/population2017/eu-pop-2017.csv
Resolving fangohr.github.io (fangohr.github.io)... 185.199.108.153, 185.199.109.153, 185.199.111.153, ...
Connecting to fangohr.github.io (fangohr.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response...
200 OK
Length: 1087 (1.1K) [text/csv]
Saving to: ‘eu-pop-2017.csv’
eu-pop-2017.csv 0%[ ] 0 --.-KB/s
eu-pop-2017.csv 100%[===================>] 1.06K --.-KB/s in 0s
2024-04-14 10:28:17 (66.2 MB/s) - ‘eu-pop-2017.csv’ saved [1087/1087]
The data source is a comma-separated-value file (CSV), which looks like this:
#NBVAL_IGNORE_OUTPUT
!head eu-pop-2017.csv
geo,pop17,pop18,births,deaths
Belgium ,11351727,11413058,119690,109666
Bulgaria,7101859,7050034,63955,109791
Czechia,10578820,10610055,114405,111443
Denmark,5748769,5781190,61397,53261
Germany,82521653,82850000,785000,933000
Estonia ,1315634,1319133,13784,15543
Ireland,4784383,4838259,62084,30324
Greece,10768193,10738868,88523,124530
Spain,46527039,46659302,390024,421269
Pandas has very strong support of reading files from different formats, including MS Excel, CSV, HDF5 and others. Each reading routine has a number of options to tailor the process.
Many data science projects leave the data in their original files, and use a few lines of Python code to import it.
df = pd.read_csv('eu-pop-2017.csv')
df
geo | pop17 | pop18 | births | deaths | |
---|---|---|---|---|---|
0 | Belgium | 11351727 | 11413058 | 119690 | 109666 |
1 | Bulgaria | 7101859 | 7050034 | 63955 | 109791 |
2 | Czechia | 10578820 | 10610055 | 114405 | 111443 |
3 | Denmark | 5748769 | 5781190 | 61397 | 53261 |
4 | Germany | 82521653 | 82850000 | 785000 | 933000 |
5 | Estonia | 1315634 | 1319133 | 13784 | 15543 |
6 | Ireland | 4784383 | 4838259 | 62084 | 30324 |
7 | Greece | 10768193 | 10738868 | 88523 | 124530 |
8 | Spain | 46527039 | 46659302 | 390024 | 421269 |
9 | France | 66989083 | 67221943 | 767691 | 603141 |
10 | Croatia | 4154212 | 4105493 | 36556 | 53477 |
11 | Italy | 60589445 | 60483973 | 458151 | 649061 |
12 | Cyprus | 854802 | 864236 | 9229 | 5997 |
13 | Latvia | 1950116 | 1934379 | 20828 | 28757 |
14 | Lithuania | 2847904 | 2808901 | 28696 | 40142 |
15 | Luxembourg | 590667 | 602005 | 6174 | 4263 |
16 | Hungary | 9797561 | 9778371 | 94646 | 131877 |
17 | Malta | 460297 | 475701 | 4319 | 3571 |
18 | Netherlands | 17081507 | 17181084 | 169200 | 150027 |
19 | Austria | 8772865 | 8822267 | 87633 | 83270 |
20 | Poland | 37972964 | 37976687 | 401982 | 402852 |
21 | Portugal | 10309573 | 10291027 | 86154 | 109586 |
22 | Romania | 19644350 | 19523621 | 189474 | 260599 |
23 | Slovenia | 2065895 | 2066880 | 20241 | 20509 |
24 | Slovakia | 5435343 | 5443120 | 57969 | 53914 |
25 | Finland | 5503297 | 5513130 | 50321 | 53722 |
26 | Sweden | 9995153 | 10120242 | 115416 | 91972 |
27 | United Kingdom | 65808573 | 66238007 | 755043 | 607172 |
We look at the dataframe as it is, and use the ‘head()’ command which will only show the first 5 lines of data:
df.head()
geo | pop17 | pop18 | births | deaths | |
---|---|---|---|---|---|
0 | Belgium | 11351727 | 11413058 | 119690 | 109666 |
1 | Bulgaria | 7101859 | 7050034 | 63955 | 109791 |
2 | Czechia | 10578820 | 10610055 | 114405 | 111443 |
3 | Denmark | 5748769 | 5781190 | 61397 | 53261 |
4 | Germany | 82521653 | 82850000 | 785000 | 933000 |
The meaning of the colums, we have to get from metada information. In this case, we have the following description of the data:
geo: the country in question
pop17: the population count of that country as of 1 January 2017
pop18: the population count of that country as of 1 January 2018
births: the number of (live) births in the country during the year 2017
deaths: the number of deaths in that country during the year 2017
The data is provided for all of the 28 European Union members (as of 2017).
We want to use the country as the country name as the index. We can achieve this either with
df2 = df.set_index('geo')
df2.head()
pop17 | pop18 | births | deaths | |
---|---|---|---|---|
geo | ||||
Belgium | 11351727 | 11413058 | 119690 | 109666 |
Bulgaria | 7101859 | 7050034 | 63955 | 109791 |
Czechia | 10578820 | 10610055 | 114405 | 111443 |
Denmark | 5748769 | 5781190 | 61397 | 53261 |
Germany | 82521653 | 82850000 | 785000 | 933000 |
Note that we cannot change the index in a given DataFrame, so the set_index()
method returns a new DataFrame. (This happens for many operations.)
An as alternative, we can also modify the import statement to already indicate which column we want to use as the index:
df = pd.read_csv('eu-pop-2017.csv', index_col="geo")
df.head()
pop17 | pop18 | births | deaths | |
---|---|---|---|---|
geo | ||||
Belgium | 11351727 | 11413058 | 119690 | 109666 |
Bulgaria | 7101859 | 7050034 | 63955 | 109791 |
Czechia | 10578820 | 10610055 | 114405 | 111443 |
Denmark | 5748769 | 5781190 | 61397 | 53261 |
Germany | 82521653 | 82850000 | 785000 | 933000 |
We explore the data by plotting some of it:
df.plot(kind='bar', y='pop17')
<Axes: xlabel='geo'>
The above shows the population as of 1 Jan 2017.
We’ll try to improve this in two ways:
we want to count population in millions. We can do this by dividing all the data by \(10^6\).
it would be interesting to sort the countries in order of size for this plot.
df_millions = df / 1e6
df_millions['pop17'].sort_values(ascending=False).plot(kind='bar')
<Axes: xlabel='geo'>
The example above selects one column from the data frame (['pop17'
) and that returns a Series
object. Then we sort this Series
object using sort_values()
according to the values (that’s the number of poeple in each country), then we plot this.
Alternatively, we could also create a plot for the whole data frame, but say that the pop17
is the column for sorting, and that we want to plot only the column with pop17
:
df_millions.sort_values(by='pop17').plot(kind='bar', y='pop17')
<Axes: xlabel='geo'>
We can also plot more than one column at the same time:
ax = df_millions.sort_values(by='pop17').plot(kind='bar', y=['pop17', 'pop18'])
We can also fine tune the plot with the usual matplotlib
commands:
ax = df_millions.sort_values(by='pop17').plot(kind='bar', y='pop17', figsize=(10, 4))
ax.set_ylabel("population 2017 [in millions]")
ax.grid()
ax.set_xlabel(None); # get rid of default label for x-axis ('geo')
Based on the number of births and deaths, we can compute change in population for each country for 2017. This is sometimes called the “natural-change”:
df['natural-change'] = df['births'] - df['deaths']
df['natural-change'].sort_values()
geo
Italy -190910
Germany -148000
Romania -71125
Bulgaria -45836
Hungary -37231
Greece -36007
Spain -31245
Portugal -23432
Croatia -16921
Lithuania -11446
Latvia -7929
Finland -3401
Estonia -1759
Poland -870
Slovenia -268
Malta 748
Luxembourg 1911
Czechia 2962
Cyprus 3232
Slovakia 4055
Austria 4363
Denmark 8136
Belgium 10024
Netherlands 19173
Sweden 23444
Ireland 31760
United Kingdom 147871
France 164550
Name: natural-change, dtype: int64
From this, we can see that the population change due to births and deaths in Italy and Germany is decreasing most in absolute terms.
To relate this to the overall size of the population, one often uses rates per year and per 1000 people in the country, such as the birth rate per 1000 inhabitants [1] (and death rate accordingly):
[1] https://en.wikipedia.org/wiki/Birth_rate
df['birth-rate'] = df['births'] / df['pop17'] * 1000
df['death-rate'] = df['deaths'] / df['pop17'] * 1000
df['natural-change-rate'] = df['natural-change'] / df['pop17'] * 1000
df.head()
pop17 | pop18 | births | deaths | natural-change | birth-rate | death-rate | natural-change-rate | |
---|---|---|---|---|---|---|---|---|
geo | ||||||||
Belgium | 11351727 | 11413058 | 119690 | 109666 | 10024 | 10.543770 | 9.660733 | 0.883037 |
Bulgaria | 7101859 | 7050034 | 63955 | 109791 | -45836 | 9.005389 | 15.459473 | -6.454085 |
Czechia | 10578820 | 10610055 | 114405 | 111443 | 2962 | 10.814533 | 10.534540 | 0.279993 |
Denmark | 5748769 | 5781190 | 61397 | 53261 | 8136 | 10.680026 | 9.264766 | 1.415260 |
Germany | 82521653 | 82850000 | 785000 | 933000 | -148000 | 9.512655 | 11.306123 | -1.793469 |
We can now look at the natural rate of change of population for each country, which is normalised by the population in that country.
ax = df.sort_values(by='natural-change-rate').plot(kind='bar', y='natural-change-rate', figsize=(10, 4))
ax.set_title("Natural change due to births and deaths per 1000 in 2017");
We can show the data together with the underlying birth and death rate data:
tmp = df.sort_values(by='natural-change-rate')
fig, axes = plt.subplots(2, 1, figsize=(12, 6))
tmp.plot(kind='bar', y=['natural-change-rate'], sharex=True, ax=axes[0])
axes[0].set_title("Population change per 1000 in 2017")
tmp.plot(kind='bar', y=['death-rate', 'birth-rate'], sharex=True, ax=axes[1])
<Axes: xlabel='geo'>
We haven’t used the information we have about the population on 1 January 2018 yet.
Let’s first look at the absolute changes in the population based on the (census?) data from 1 Jan 2017 and 1 Jan 2018:
df['change'] = df['pop18'] - df['pop17']
ax = df.sort_values(by='change').plot(y='change', kind='bar')
ax.set_title("Total change in population per country in 2017");
With that information, we can estimate migration. (It is important to note that this estimated number will also absorb all inaccuracies or changes of the data gathering method, in the original data described as “statistical adjustment”.)
df['migration'] = df['change'] - df['natural-change']
df.head()
pop17 | pop18 | births | deaths | natural-change | birth-rate | death-rate | natural-change-rate | change | migration | |
---|---|---|---|---|---|---|---|---|---|---|
geo | ||||||||||
Belgium | 11351727 | 11413058 | 119690 | 109666 | 10024 | 10.543770 | 9.660733 | 0.883037 | 61331 | 51307 |
Bulgaria | 7101859 | 7050034 | 63955 | 109791 | -45836 | 9.005389 | 15.459473 | -6.454085 | -51825 | -5989 |
Czechia | 10578820 | 10610055 | 114405 | 111443 | 2962 | 10.814533 | 10.534540 | 0.279993 | 31235 | 28273 |
Denmark | 5748769 | 5781190 | 61397 | 53261 | 8136 | 10.680026 | 9.264766 | 1.415260 | 32421 | 24285 |
Germany | 82521653 | 82850000 | 785000 | 933000 | -148000 | 9.512655 | 11.306123 | -1.793469 | 328347 | 476347 |
Let’s plot the total change of the population per country in the top subfigure, and the contribution from natural changes and migration in the lower subfigure:
tmp = df.sort_values(by='change')
fig, axes = plt.subplots(2, 1, figsize=(12, 6))
tmp.plot(kind='bar', y=['change'], sharex=True, ax=axes[0])
axes[0].set_title("Population changes in 2017")
axes[0].legend(['total change of population (migration + natural change due to deaths and births'])
tmp.plot(kind='bar', y=['migration', 'natural-change'], sharex=True, ax=axes[1])
axes[1].legend(['Migration', "natural change due to deaths and births"])
axes[1].set_xlabel(None);
17.9. Further reading#
There is a lot more to say about Pandas. The following resources may be useful but there are countless others available:
Further reading on
[]
,.loc[]
and.iloc[]
from Ted Petrou as a Jupyter Notebook and blog entry.Jake VanderPlas: Python Data Science Handbook online