We will continue our exploration of the amazing Pandas-datareader. In this tutorial we will further investigate data from World Bank and correlate it with S&P 500 Stock index. We will do this both by visualizing 3 graphs on 2 different y-axis as well as compute the correlation.
Step 1: Get World Bank data
In this tutorial we will only look at data from United States. If you are interested in other tutorials on. World Bank data you should read this one and this one.
To get the data of the World Bank you can use the Pandas-datareader, which has a function to download data if you have the indicator.
Problem! The date is a datetime object in the above Series, while it is a string with a year in the DataFrame with unemployment rate and interest rate above.
To successfully join them, we need to convert them into same format. The best way is to convert them into a datetime. We can do that by using the pd.to_datetime() function.
import pandas_datareader as pdr
import pandas as pd
import datetime as dt
from pandas_datareader import wb
data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019)
uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']
data = int_data.join(uem_data)
data.index = pd.to_datetime(data.index, format='%Y')
start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&P 500'
data = sp500.to_frame().join(data, how='outer')
print(data)
Resulting in the following output.
S&P 500 US-int US-unempl
1990-01-01 NaN 6.039744 NaN
1990-01-02 359.690002 NaN NaN
1990-01-03 358.760010 NaN NaN
1990-01-04 355.670013 NaN NaN
1990-01-05 352.200012 NaN NaN
... ... ... ...
2019-12-24 3223.379883 NaN NaN
2019-12-26 3239.909912 NaN NaN
2019-12-27 3240.020020 NaN NaN
2019-12-30 3221.290039 NaN NaN
2019-12-31 3230.780029 NaN NaN
The problem you see there, is that data from US-int and US-unempl. only has data the first of January every year. To fix that, we can make a linear interpolation of the data by applying the following.
data = data.interpolate(method='linear')
Resulting in.
S&P 500 US-int US-unempl
1990-01-01 NaN 6.039744 NaN
1990-01-02 359.690002 6.035318 NaN
1990-01-03 358.760010 6.030891 NaN
1990-01-04 355.670013 6.026464 NaN
1990-01-05 352.200012 6.022037 NaN
... ... ... ...
2019-12-24 3223.379883 3.478200 3.682
2019-12-26 3239.909912 3.478200 3.682
2019-12-27 3240.020020 3.478200 3.682
2019-12-30 3221.290039 3.478200 3.682
2019-12-31 3230.780029 3.478200 3.682
Notice, that since there is no unemployment data for 1990 in US, it will fill them with NaN until first rate is given.
Step 3: Visualize all three graphs with 3 different y-axis
Now here Pandas are quite strong. By default, you can create a secondary y-axis. As the 3 datasets only need two y-axis, as the unemployment and interest rate can share the same y-axis.
import pandas_datareader as pdr
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
from pandas_datareader import wb
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
data = wb.download(indicator=['SL.UEM.TOTL.ZS', 'FR.INR.RINR'], country=['US'], start=1990, end=2019)
uem_data = data.unstack().T.loc['SL.UEM.TOTL.ZS']
uem_data.columns = ['US-unempl']
int_data = data.unstack().T.loc['FR.INR.RINR']
int_data.columns = ['US-int']
data = int_data.join(uem_data)
data.index = pd.to_datetime(data.index, format='%Y')
start = dt.datetime(1990, 1, 1)
end = dt.datetime(2019, 12, 31)
sp500 = pdr.get_data_yahoo("^GSPC", start, end)['Adj Close']
sp500.name='S&P 500'
data = sp500.to_frame().join(data, how='outer')
data = data.interpolate(method='linear')
ax = data['S&P 500'].plot(legend=True)
ax = data[['US-int','US-unempl']].plot(ax=ax, secondary_y=True, legend=True)
print(data.corr())
plt.show()
We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial, and finally, the third part before continuing.
In this part we will investigate if we can see any correlation between the major of education and the 6 dimensions of the personality types in RIASEC.
Step 1: Group into major of educations
This is getting tricky, as the majors are typed in by the respondent. We will be missing some connections between them.
But let’s start by exploring them.
import pandas as pd
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
print(major.groupby('major').size().sort_values(ascending=False))
The output is given here.
major
psychology 6861
Psychology 5763
English 2342
Business 2290
Biology 1289
...
Sociology, Social work 1
Sociology, Psychology 1
Sociology, Math 1
Sociology, Linguistics 1
Nuerobiology 1
Length: 15955, dtype: int64
Where we identify one problem, that some write with lowercase and others with uppercase.
Step 2: Clean up a few ambiguities
The first step would be to lowercase everything.
import pandas as pd
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
Now printing the 10 first lines.
major
psychology 12766
business 3496
english 3042
nursing 2142
biology 1961
education 1800
engineering 1353
accounting 1186
computer science 1159
psychology 1098
dtype: int64
Where we notice that psychology is the first and last. Inspecting it further, it seems the the last one has a space after it. Hence, we can try to remove whitespaces around all educations.
import pandas as pd
import numpy as np
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
major = data.loc[:,['major']]
major['major'] = major['major'].str.lower()
major['major'] = major.apply(lambda row: row['major'].strip() if row['major'] is not np.nan else np.nan, axis=1)
print(major.groupby('major').size().sort_values(ascending=False).iloc[:10])
Now the output is as follows.
major
psychology 13878
business 3848
english 3240
nursing 2396
biology 2122
education 1954
engineering 1504
accounting 1292
computer science 1240
law 1111
dtype: int64
Introducing law at the bottom of the list.
This process could continue, but let’s keep the focus on these 10 highest representative educations in the dataset. Obviously, further data points could be added if investigating it further.
Step 3: See if education correlates to known words
First let’s explore the dataset a bit more. The respondents are asked if they know the definitions of the following words.
boat
incoherent
pallid
robot
audible
cuivocal
paucity
epistemology
florted
decide
pastiche
verdid
abysmal
lucid
betray
funny
Each word they know they mark. Hence, we can count the number of words each respondent knows and calculate an average per major group.
Correlation between major and RIASEC personality traits
Biology has high I (Investigative, people that prefer to work with data). While the R (Realistic, People who like to work with things) is dominated by Engineers and Computer Scientist.
Hmm… I should have noticed that many have major education.
We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first as well as the second part of the tutorial.
In this part we are going to combine some data into 6 dimensions of personality types of the RIASEC and see it there is any correlation with the educational level.
Step 1: Understand the dataset better
The dataset is combined in letting the respondents rate themselves on statements related to the 6 personality types in RIASEC. The personality types are given as follows (also see wikipedia for deeper description).
Realistic (R): People that like to work with things. They tend to be “assertive and competitive, and are interested in activities requiring motor coordination, skill and strength”. They approach problem solving “by doing something, rather than talking about it, or sitting and thinking about it”. They also prefer “concrete approaches to problem solving, rather than abstract theory”. Finally, their interests tend to focus on “scientific or mechanical rather than cultural and aesthetic areas”.
Investigative (I): People who prefer to work with “data.” They like to “think and observe rather than act, to organize and understand information rather than to persuade”. They also prefer “individual rather than people oriented activities”.
Artistic (A): People who like to work with “ideas and things”. They tend to be “creative, open, inventive, original, perceptive, sensitive, independent and emotional”. They rebel against “structure and rules”, but enjoy “tasks involving people or physical skills”. They tend to be more emotional than the other types.
Social (S): People who like to work with “people” and who “seem to satisfy their needs in teaching or helping situations”. They tend to be “drawn more to seek close relationships with other people and are less apt to want to be really intellectual or physical”.
Enterprising (E): People who like to work with “people and data”. They tend to be “good talkers, and use this skill to lead or persuade others”. They “also value reputation, power, money and status”.
Conventional (C): People who prefer to work with “data” and who “like rules and regulations and emphasize self-control … they like structure and order, and dislike unstructured or unclear work and interpersonal situations”. They also “place value on reputation, power, or status”.
In the test they have rated themselves from 1 to 5 (1=Dislike, 3=Neutral, 5=Enjoy) on statements related to these 6 personality types.
That way each respondent can be rated on these 6 dimensions.
Step 2: Prepare the dataset
We want to score the respondent according to how they have rated themselves on the 8 statements for each of the 6 personality types.
In the view we make, we keep the education with the dimension ratings we have calculated, because we want to see if there is any correlation between education level and personality type.
education 2.394318
R 16.651624
I 23.994637
A 22.887701
S 26.079349
E 20.490080
C 19.105188
dtype: float64
Which says that the average educational level of the 145,000+ respondents was 2.394318. Then you can see the respondent related on average mostly as Social, then Investigative. The lowest rated group was Realistic.
The output of educational group by mean is given here.
R I A S E C
education
1 15.951952 23.103728 21.696007 23.170792 19.897772 17.315641
2 16.775297 23.873645 22.379625 25.936032 20.864591 19.551138
3 16.774487 24.302158 23.634034 27.317784 20.468160 19.606312
4 16.814534 24.769829 24.347250 27.382699 20.038501 18.762395
Where you can see that those with less than high school actually rate themselves lower in all dimensions. While the highest educated rate themselves highest on Realistic, Artistic, and Social.
Does that mean the higher education the more social, artistic or realistic you are?
The output of the correlation is given here.
education R I A S E C
education 1.000000 0.029008 0.057466 0.105946 0.168640 -0.006115 0.044363
R 0.029008 1.000000 0.303895 0.206085 0.109370 0.340535 0.489504
I 0.057466 0.303895 1.000000 0.334159 0.232608 0.080878 0.126554
A 0.105946 0.206085 0.334159 1.000000 0.350631 0.322099 0.056576
S 0.168640 0.109370 0.232608 0.350631 1.000000 0.411564 0.213413
E -0.006115 0.340535 0.080878 0.322099 0.411564 1.000000 0.526813
C 0.044363 0.489504 0.126554 0.056576 0.213413 0.526813 1.000000
As you see. You should conclude that. Take Social it is only 0.168640 correlated to education, which in other words means very low correlated. The same holds for Realistic and Artistic, very low correlation.
Step 4: Visualize our findings
A way to visualize the data is by using the great integration with Matplotlib.
We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.
In this tutorial we will find some data points that are not correct and a potential way to deal with it.
Step 1: Explore the family sizes from the respondents
In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.
From previous tutorial.
In this we will explore the dataset further. The dataset is available here.
import pandas as pd
# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)
If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.
Step 2: Explore the distribution of family sizes
What often happens in dataset is there might be inaccurate data.
To get a feeling of the data in the column familysize, you can explore it by running this.
import pandas as pd
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())
Resulting in the following from the describe output.
count 1.458280e+05
mean 1.255801e+05
std 1.612271e+07
min 0.000000e+00
25% 2.000000e+00
50% 3.000000e+00
75% 3.000000e+00
max 2.147484e+09
Name: familysize, dtype: float64
Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.
Grouping the data into bins (by using the cut function combined with value_count) you get this output.
The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.
Step 3: Clean the data
You need to make a decision on these data points that seem to skew your data in a wrong way.
Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.
Iceland? What’s up?
It seems like Iceland has a tradition for big families.
Interestingly it give only one line that does not seem correct.
74745 999999999
But as there are only a few respondents the average is the highest.
To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.
Cleaning the data is simple.
data = data[data['familysize'] < 10]
Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.
import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]
country_mean = data.groupby(['alpha3']).mean()
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()
Resulting in the following output.
Family sizes of the respondents
Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.
We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.
In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.
Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.
There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).
Other metadata can be seen explained here.
education "How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban "What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender "What is your gender?", 1=Male, 2=Female, 3=Other
engnat "Is English your native language?", 1=Yes, 2=No
age "How many years old are you?"
hand "What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion "What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation "What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race "What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
voted "Have you voted in a national election in the past year?", 1=Yes, 2=No
married "What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize "Including you, how many children did your mother have?"
major "If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"
These values were also calculated for technical information:
uniqueNetworkLocation 1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
country The country of the network the user connected from
source 1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined
Step 2: Loading the data into a DataFrame (Pandas)
First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.
import pandas as pd
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 150)
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)
The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).
Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.
The output from the above code is.
R1 R2 R3 R4 R5 ... uniqueNetworkLocation country source major Unnamed: 93
0 3 4 3 1 1 ... 1 US 2 NaN NaN
1 1 1 2 4 1 ... 1 US 1 Nursing NaN
2 2 1 1 1 1 ... 1 US 1 NaN NaN
3 3 1 1 2 2 ... 1 CN 0 NaN NaN
4 4 1 1 2 1 ... 1 PH 0 education NaN
... .. .. .. .. .. ... ... ... ... ... ...
145823 2 1 1 1 1 ... 1 US 1 Communication NaN
145824 1 1 1 1 1 ... 1 US 1 Biology NaN
145825 1 1 1 1 1 ... 1 US 2 NaN NaN
145826 3 4 4 5 2 ... 2 US 0 yes NaN
145827 2 4 1 4 2 ... 1 US 1 Information systems NaN
Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).
We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.
Step 3: Count the occurrences of each country
As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.
The data is located in the ‘country’ column of the DataFramedata.
To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.
print(data.groupby(['country']).size())
Where the first few lines are.
country
AD 2
AE 507
AF 8
AG 7
AL 116
AM 10
Hence, for each country we will have a count of how many respondents came from that country.
Step 4: Understand the map data we want to merge it with
To visualize the data, we need some way to have a map.
Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.
Let’s just explore that.
import geopandas
import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
world.plot()
plt.show()
Which will make the following map.
World map using GeoPandas and Maplotlib
This is too easy to be true. No, not really. This is the reality of Python.
We want to merge the data from out world map above with the data of counts for each country.
We need to see how to merge it. To do that let us look at the data from world.
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
print(world)
Where the first few lines are.
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.00 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 53950935 Africa Tanzania TZA 150600.00 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 603253 Africa W. Sahara ESH 906.50 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 35623680 North America Canada CAN 1674000.00 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 326625791 North America United States of America USA 18560000.00 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.
Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.
import pandas as pd
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.
The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.
Now the data contains a row with the countries in 3 letters like world.
We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).
import geopandas
import pandas as pd
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
print(map)
The first few lines are given below.
pop_est continent name iso_a3 gdp_md_est geometry count \
0 920938 Oceania Fiji FJI 8374.00 MULTIPOLYGON (((180.00000 -16.06713, 180.00000... 12.0
1 53950935 Africa Tanzania TZA 150600.00 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... 9.0
2 603253 Africa W. Sahara ESH 906.50 POLYGON ((-8.66559 27.65643, -8.66512 27.58948... NaN
3 35623680 North America Canada CAN 1674000.00 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... 7256.0
4 326625791 North America United States of America USA 18560000.00 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... 80579.0
5 18556698 Asia Kazakhstan KAZ 460700.00 POLYGON ((87.35997 49.21498, 86.59878 48.54918... 46.0
Notice, that some countries do not have a count. Those a countries with no respondent.
Step 6: Ready to plot a world map
Now to the hard part, right?
Making a colorful map indicating the number of respondents in a given country.
import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('count', figsize=(10,3), legend=True)
plt.show()
It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.
The output
Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.
What to do? Use a log-scale.
You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.
See the magic.
import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
country_count['log_count'] = np.log(country_count['count'])
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('log_count', figsize=(10,3), legend=True)
plt.show()
Where the new magic is to add the log_count and using np.log(country_count[‘count’]).
Also notice that the plot is now done on ‘log_count’.
The final output.
Now you see more of a variety in the countries respondents. Note that the “white” countries did not have any respondent.
A key process in Data Science is to merge data from various sources. This can be challenging and often needs clarity. Here we will take some simple example and explain the differences of how to merge data using the pandas library‘s DataFrame object merge function.
The key ways to merge is by inner, outer, left, and right.
In this example we are going to explore what correlates the most to GDP per capita: yearly meet consumption, yearly beer consumption, or long-term unemployment.
What is your educated guess? (no cheating, the result is down below)
Step 1: The data we want to merge
That means we need to gather the specified data.
The GDP per capita can be found on wikipedia.org. As we are going to do a lot of the same code again and again, let’s make a helper function to get the data, index the correct table, and drop the data we do not use in our analysis.
This can be done like this.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
print(table)
Which results in this output (or the few first lines of it).
We can identify that this is the middle GDP, based on the World Bank.
Then we need data from the other sources. Here we get it for long-term unemployment (long-term unemployment is defined to be unemployed for 1 year or more).
Strange, you might think? There is only 25 countries (counting from 0). Also, let’s look at the actual correlation between columns, which is the output of the second print statement.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
We also see that inner merge is the default. So what does inner merge do?
It means, it will only merge on keys which exists for both DataFrames. Translated to our tables, it means, that the only remaining rows in the final merged table is the ones which exists for all 4 tables.
You can check that, it is the 25 countries listed there.
Step 4: Understand what we should do
What we are doing in the end is correlate to the GDP per capita. Hence, it only makes sense to keep the values that have a GDP.
Consider we used outer merge, then we will keep all combinations. That would not give any additional value to the calculations we want to do.
But let’s just try it and investigate the output.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join, how='outer')
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join, how='outer')
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join, how='outer')
print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))
First of all, this keeps all the output. I will not put it here, but only show a few lines.
Country US$ Long-term unemployment rate Kg meat/person Liter beer/person
0 Monaco (2018) 185741.0 NaN NaN NaN
1 Liechtenstein (2017) 173356.0 NaN NaN NaN
2 Luxembourg 114705.0 1.60 141.7 NaN
222 United States of America NaN NaN 124.8 NaN
223 United States Virgin Islands NaN NaN 6.6 NaN
224 Venezuela NaN NaN 56.6 NaN
225 Taiwan NaN NaN NaN 23.2
As the sample lines above shows, we get a row if one of them column is notNaN. Before when we used inner we would only get lines when all columns were notNaN.
This is different values than from the previous example. Surprised? Not really. Now we have more data to correlate.
Step 5: Do the correct thing
If we inspect the code, we can see that the we start by having the GDP table on the left side. This growing table is always kept on the left side. Hence, we should be able to merge with left. Notice that this should not affect the final result.
Let’s try it.
import pandas as pd
# This is simply used to display all the data and not get a small window of it
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 1000)
# This is a helper function, read the URL, get the right table, drop some columns
def read_table(url, table_number, drop_columns):
tables = pd.read_html(url)
table = tables[table_number]
table = table.drop(drop_columns, axis=1)
return table
# GDP per capita
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'
table = read_table(url, 3, ['Rank'])
table.rename(columns={'Country/Territory': 'Country'}, inplace=True)
# Long-term unemployement
url = 'https://en.wikipedia.org/wiki/List_of_OECD_countries_by_long-term_unemployment_rate'
table_join = read_table(url, 0, ['Long-term unemployment rate (2012)[1]'])
table_join.rename(columns={'Country/Territory': 'Country', 'Long-term unemployment rate (2016)[1]': 'Long-term unemployment rate'}, inplace=True)
index = 'Long-term unemployment rate'
table_join[index] = table_join[index].str[:-1].astype(float)
table = pd.merge(table, table_join, how='left')
# Meat consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_meat_consumption'
table_join = read_table(url, 1, ['Kg/person (2009)[10]'])
table_join.rename(columns={'Kg/person (2002)[9][note 1]': 'Kg meat/person'}, inplace=True)
table = pd.merge(table, table_join, how='left')
# Beer consumption
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita'
table_join = read_table(url, 2, ['2018change(litres per year)', 'Total nationalconsumption[a](million litresper year)', 'Year', 'Sources'])
table_join.rename(columns={'Consumptionper capita[1](litres per year)': 'Liter beer/person'}, inplace=True)
table = pd.merge(table, table_join, how='left')
print(table)
# Calculate the correlation
table_corr = table.corr()
# Print the correlation to GDP per capita (stored in US$).
print(table_corr['US$'].sort_values(ascending=False))
The row with US$ shows the full correlation to GDP per capita, which obviously has 100% (1.00000) correlation to GDP per capita, as it is the number itself.
The second row tells us that eating a lot of meat is highly correlated to GDP per capita. Does that then mean that a country should encourage all citizens to eat more meat to become richer? No, you cannot conclude that. It is probably the other way around. The richer a country is, the more meat they eat.
The last line tells us that long-term unemployment is negative related to GDP per capita. It is not surprising. It means, the more long-term unemployed people, the less GDP per capita. But it is not highly correlated, only (approximately) -25%.
Surprisingly, it seems to have bigger positive impact to drink a lot of beer, then it has negative impact of long-term unemployment.
Notice that we use pd.set_option calls to get the full view. If you are new to read_html from the pandas library, we recommend you read this tutorial.
The top of the output will be as follows.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail $514,405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas $414,650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas $396,556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas $392,976 $2,270 1382401 China [8]
4 5 State Grid Electricity $387,056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas $355,905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas $303,738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas $290,212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive $278,341 $14,332 664496 Germany [13]
And the last lines.
Rank int64
Name object
Industry object
Revenue(USD millions) object
Profit(USD millions) object
Employees int64
Country object
Ref object
dtype: object
Where we see interesting information about what data types each column has. Not surprisingly, the Revenue and Profit columns are of type object (which are strings in this case).
Hence, if we want to sum up values, we need to transform them to floats. This is a bit tricky, as the output shows above. An example is $6,670, where there are two issues to transform them to floats. First, there is a dollars ($) sign in the beginning. Second, there is comma (,) in the number, which a simple cast to float does not handle.
Now let’s deal with them in each their method.
Method 1: Using pandas DataFrame/Series vectorized string functions
Vectorization with pandas data structures is the process of executing operations on entire data structure. This is handy, as the alternative would be to make a loop-function.
Also, the pandas has many string functions available for vectorization as you can see in the documentation.
First of, we can access the string object by using the .str, then we can apply the string function. In our case, we will use the substring with square brackets to remove the dollar sign.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514,405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414,650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396,556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392,976 $2,270 1382401 China [8]
4 5 State Grid Electricity 387,056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355,905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303,738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290,212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive 278,341 $14,332 664496 Germany [13]
Then we need to remove the comma (,). This can be done by using replace.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976 $2,270 1382401 China [8]
4 5 State Grid Electricity 387056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive 278341 $14,332 664496 Germany [13]
Finally, we need to convert the string to a float.
Which does not change the printed output, but the type of the column.
Nice and easy, to prepare the data in one line. Notice, that you could chose to make it in multiple lines. It is a matter of taste.
Method 2: Using pandas DataFrame lambda function
Another way to prepare data is by using a lambda function. If you are new to lambda functions, we recommend you read this tutorial.
Here you can do it row by row and apply your defined lambda function.
The next column has the same challenge as the first one. So let’s apply it on that.
In this case, we cannot use the substring with square brackets like in the case above, as some figures are negative and contain that minus sign before the dollar sign. But using the replace call will do fine.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405.0 6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650.0 5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556.0 23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976.0 2,270 1382401 China [8]
4 5 State Grid Electricity 387056.0 8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905.0 110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738.0 9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212.0 20,840 71000 United States [12]
8 9 Volkswagen Automotive 278341.0 14,332 664496 Germany [13]
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405.0 6670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650.0 5845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556.0 23352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976.0 2270 1382401 China [8]
4 5 State Grid Electricity 387056.0 8174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905.0 110974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738.0 9383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212.0 20840 71000 United States [12]
8 9 Volkswagen Automotive 278341.0 14332 664496 Germany [13]
Finally, we will do the same for casting it to a float.
To be honest, it is a matter of taste in this case. When things can be achieved by simple string manipulation calls that are available through the vectorized calls, there is nothing to gain by lambda functions.
The strength of lambda functions is the flexibility. You can actually do anything function in there, which is a big strength. The vectorized functions are limited to simple operations, which covers a lot of use cases.
Putting it all together
Well, now we came so far, let’s put it all together and get some nice data. Sum it up and print it sorted out and make a horizontal bar plot.
When you work with data it is nice to use a library made for it. Here the Pandas library comes in handy, which is a powerful data analysis and manipulation tool.
Using the Pandas library, the data can be read into a DataFrame, which is the main data structure in the library. Using the read_html it returns a list of DataFrames, one per each table on the url in the argument. If you are new to read_html, we recommend you read this tutorial.
Rank Country (or dependent territory) Population % of worldpopulation Date Source
0 1 China[b] 1403554760 NaN 16 Jul 2020 National population clock[3]
1 2 India 1364764203 NaN 16 Jul 2020 National population clock[4]
2 3 United States[d] 329963086 NaN 16 Jul 2020 National population clock[5]
3 4 Indonesia 269603400 NaN 1 Jul 2020 National annual projection[6]
4 5 Pakistan[e] 220892331 NaN 1 Jul 2020 UN Projection[2]
5 6 Brazil 211800078 NaN 16 Jul 2020 National population clock[7]
6 7 Nigeria 206139587 NaN 1 Jul 2020 UN Projection[2]
7 8 Bangladesh 168962650 NaN 16 Jul 2020 National population clock[8]
Step 2: Remove unnecessary columns from your data
A good second step is to remove columns you do not need. This can be done by a call to drop. As we only need the country names and populations, we can remove the rest of the columns.
Country (or dependent territory) Population
0 China[b] 1403554760
1 India 1364764203
2 United States[d] 329963086
3 Indonesia 269603400
4 Pakistan[e] 220892331
5 Brazil 211800078
6 Nigeria 206139587
7 Bangladesh 168962650
This makes it easier to understand the data.
Another thing you can do is to rename the column Country (or dependent territory) to Country. This makes your code easier to write when you need to access that column of data.
Country Population
0 China[b] 1403554760
1 India 1364764203
2 United States[d] 329963086
3 Indonesia 269603400
4 Pakistan[e] 220892331
5 Brazil 211800078
6 Nigeria 206139587
7 Bangladesh 168962650
Step 3: Cleaning the data
We see that Country column can have two types of added information in the field. See examples here.
Country Population
0 China[b] 1403554760
195 Jersey (UK) 107800
Either it can have square brackets with a letter (example [b]) or a space and brackets and a country (example (UK)).
This can be cleaned by using a lambda function. If you are new to lambda functions we recommend you read this tutorial.
Country Population
0 China 1403554760
1 India 1364764203
2 United States 329963086
3 Indonesia 269603400
4 Pakistan 220892331
5 Brazil 211800078
6 Nigeria 206139587
7 Bangladesh 168962650
Finally, if you investigate the last line of output.
241 World 7799525000
You see it is the sum of all the populations. This row is not part of the dataset and should be removed.
geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.
web: https://geopy.readthedocs.io/en/stable/
It is easy to use, but… it is slow.
When you run the code several times and you debug, you want to avoid waiting for 200+ lookups. Hence, I have created a small persistence to reuse already lookup locations.
import numpy as np
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
import pickle
import os
class Locator:
def __init__(self):
self.pickle_name = "location_store.pickle"
self.geo_locator = Nominatim(user_agent="LearnPython")
self.location_store = {}
if os.path.isfile(self.pickle_name):
f = open(self.pickle_name, "rb")
self.location_store = pickle.load(f)
f.close()
def get_location(self, location_name):
if location_name in self.location_store:
return self.location_store[location_name]
try:
location = self.geo_locator.geocode(location_name, language='en')
self.location_store[location_name] = location
f = open(self.pickle_name, 'wb')
pickle.dump(self.location_store, f)
f.close()
except GeocoderTimedOut:
location = None
return location
def get_latitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.latitude
else:
return np.nan
def get_longitude(self, location_name):
location = self.get_location(location_name)
if location:
return location.longitude
else:
return np.nan
What we want to do with this class is to look up latitudes and longitudes and add them to our data source. As we run the code several times (or I did at least), I got tired of waiting for several long seconds (probably more than a minute) each time I ran the code. To save you and the planet for wasteful seconds, I share this code to you.
And now we can use it for adding data to our DataFrame.
Country Population Lat Lon
0 China 1403554760 35.000074 104.999927
1 India 1364764203 22.351115 78.667743
2 United States 329963086 39.783730 -100.445882
3 Indonesia 269603400 -2.483383 117.890285
4 Pakistan 220892331 30.330840 71.247499
5 Brazil 211800078 -10.333333 -53.200000
6 Nigeria 206139587 9.600036 7.999972
7 Bangladesh 168962650 24.476878 90.293243
There is actually one location which the GeoPy does not recognize.
231 Saint Helena, Ascensionand Tristan da Cunha 5633 NaN NaN
Instead of doing the right thing for these 5,633 people, which also count in the world population, I did the wrong thing.
table = table.dropna()
This call to dropna() does what you think. It removes the rows containing NaN, like the one above.
With my deepest respect to the people in Saint Helena, I apologize for my incorrect behavior.
Step 5: Create the plots
Now this is smart. To make histograms with Pandas it can do all the work for you. What you really want, is to do a accumulated histogram, which is called a weighted histogram.
Meaning, the histogram only counts occurrences. What we want is to add together values to see on which latitudes (and longitudes) people live.
How to collect data from a HTML table into a Pandas DataFrame.
The cleaning process and how to convert the data into the correct type.
Also, dealing with some data points that are not in correct representation.
Finally, how to sum up by countries.
Step 1: Collect the data from the table
Pandas is an amazing library with a lot of useful data analysis functionality right out of the box. First step in any data analysis is to collect the data. In this tutorial we will collect the data from wikipedia’s page on List of metro systems.
If you are new to the pandas library we recommend you read the this tutorial.
The objective will be to find the sums of Stations, Systems length, and Annual ridership per each country.
From wikipedia.org
At first glance this looks simple, but looking further down we see that some countries have various rows.
From wikipedia.org
Also, some rows do not have all the values needed.
First challenge first. Read the data from the table into a DataFrame, which is the main data structure of the pandas library. The read_html call from a pandas will return a list of DataFrames.
If you use read_html for the first time, we recommend you read this tutorial.
Which results in the following output (or the top of it).
City Country Name Yearopened Year of lastexpansion Stations System length Annual ridership(millions)
0 Algiers Algeria Algiers Metro 2011[13] 2018[14] 19[14] 18.5 km (11.5 mi)[15] 45.3 (2019)[R 1]
1 Buenos Aires Argentina Buenos Aires Underground 1926[Nb 1] 2019[16] 90[17] 56.7 km (35.2 mi)[17] 337.7 (2018)[R 2]
2 Yerevan Armenia Yerevan Metro 1981[18] 1996[19] 10[18] 13.4 km (8.3 mi)[18] 18.7 (2018)[R 3]
3 Sydney Australia Sydney Metro 2019[20] – 13[20] 36 km (22 mi)[20][21] 14.2 (2019) [R 4][R Nb 1]
4 Vienna Austria Vienna U-Bahn 1976[22][Nb 2] 2017[23] 98[24] 83.3 km (51.8 mi)[22] 463.1 (2018)[R 6]
5 Baku Azerbaijan Baku Metro 1967[25] 2016[25] 25[25] 36.6 km (22.7 mi)[25] 231.0 (2018)[R 3]
We have now have the data in a DataFrame.
Step 2: Clean and convert the data
At first glance, we see that we do not need the rows City, Name, Yearopened, Year of last expansion. To make it easier to work with the data, let’s remove them and inspect the data again.
Country Stations System length Annual ridership(millions)
0 Algeria 19[14] 18.5 km (11.5 mi)[15] 45.3 (2019)[R 1]
1 Argentina 90[17] 56.7 km (35.2 mi)[17] 337.7 (2018)[R 2]
2 Armenia 10[18] 13.4 km (8.3 mi)[18] 18.7 (2018)[R 3]
3 Australia 13[20] 36 km (22 mi)[20][21] 14.2 (2019) [R 4][R Nb 1]
4 Austria 98[24] 83.3 km (51.8 mi)[22] 463.1 (2018)[R 6]
5 Azerbaijan 25[25] 36.6 km (22.7 mi)[25] 231.0 (2018)[R 3]
6 Belarus 29[27] 37.3 km (23.2 mi)[27] 283.4 (2018)[R 3]
7 Belgium 59[28][Nb 5] 39.9 km (24.8 mi)[29] 165.3 (2019)[R 7]
This makes it easier to see the next steps.
Let’s take them one by one. Stations need to remove the data after ‘[‘-symbol and convert the number to an integer. This can be done by using a lambda function to a row.
If you are new to lambda functions we recommend you read this tutorial.
The next thing we need to do is to convert the System length to floats. The length will be in km (I live in Denmark, where we use km and not mi). This can also be done by using a lambda function
Finally, and a bit more tricky, we need to convert the column of Annual ridership. The challenge is that lines have n/a which are converted to np.nan, but there are also some lines where the input is not easy to convert, as the images show.
From wikipedia.orgFrom wikipedia.org
These lines are can be dealt with by using a helper function.
def to_float(obj):
try:
return float(obj)
except:
return np.nan
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
Adding this all together we get the following code.
import pandas as pd
import numpy as np
def to_float(obj):
try:
return float(obj)
except:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
print(table)
Which results in the following output (or the first few lines).
Country Stations System length Annual ridership(millions)
0 Algeria 19 18.50 45.30
1 Argentina 90 56.70 337.70
2 Armenia 10 13.40 18.70
3 Australia 13 36.00 14.20
4 Austria 98 83.30 463.10
5 Azerbaijan 25 36.60 231.00
6 Belarus 29 37.30 283.40
7 Belgium 59 39.90 165.30
8 Brazil 19 28.10 58.40
9 Brazil 25 42.40 42.80
10 Brazil 22 43.80 51.70
Step 3: Sum rows by country
Say, now we want to get the country with the most metro stations. This can be achieved by using the groupby and sum function from the pandas DataFrame data structure.
import pandas as pd
import numpy as np
def to_float(obj):
try:
return float(obj)
except:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
# Sum up
table_sum = table.groupby(['Country']).sum()
print(table_sum.sort_values(['Stations'], ascending=False))
Where the result will be China.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
United States 1005 1325.90 2771.50
South Korea 714 839.90 4054.90
Japan[Nb 34] 669 791.20 6489.60
India 499 675.97 1377.00
France 483 350.90 2113.50
Spain 438 474.40 1197.90
If we want to sort by km of System length, you will only need to change the last line to the following.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
United States 1005 1325.90 2771.50
South Korea 714 839.90 4054.90
Japan[Nb 34] 669 791.20 6489.60
India 499 675.97 1377.00
Russia 368 611.50 3507.60
United Kingdom 390 523.90 1555.30
Finally, if you want it by Annual ridership, you will need to change the last line to.
Remember, we assigned that to index. You should get the following output.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
Japan[Nb 34] 669 791.20 6489.60
South Korea 714 839.90 4054.90
Russia 368 611.50 3507.60
United States 1005 1325.90 2771.50
France 483 350.90 2113.50
Brazil 243 345.40 2106.20
import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
print(table)
The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)
regr = LinearRegression()
regr.fit(X, Y)
Y_pred = regr.predict(X)
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
Which will result in the following plot.
Linear regression model applied on data from wikipedia.org
Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.
Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.
We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
coef = []
countries = []
for index, row in table.iterrows():
#print(row)
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)
regr = LinearRegression()
regr.fit(X, Y)
coef.append(regr.coef_[0][0])
countries.append(row[merge_index])
data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])
print(data)
Which results in the following output (or the first few lines).
Country Coef
0 Afghanistan 0.161847
1 Albania 0.243493
2 Algeria 0.103907
3 Angola 0.423919
4 Antigua and Barbuda 0.087863
5 Argentina 0.090837
6 Armenia 4.699598
Step 3: Merge the data to a leaflet map using folium
The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.
import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
coef = []
countries = []
for index, row in table.iterrows():
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)
regr = LinearRegression()
regr.fit(X, Y)
coef.append(regr.coef_[0][0])
countries.append(row[merge_index])
data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])
# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')
# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])
# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])
# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])
print(table)
# Create a map
my_map = folium.Map()
# Add the data
folium.Choropleth(
geo_data=table,
name='choropleth',
data=table,
columns=['Country', 'Cat'],
key_on='feature.properties.name',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Growth of GDP since 1990',
threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')
There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.
Here we have used the qcut to add them in each equal sized group.
This should result in an interactive html page looking something like this.