Visualize Inflation for 2019 using Pandas-datareader and GeoPandas

What will we cover in this tutorial?

In this tutorial we will visualize the inflation on a map. This will be done by getting the inflation data directly from World Bank using the Pandas-datareader. This data will be joined with data from GeoPandas, which provides a world map we can use to create a Choropleth map.

The end result

Step 1: Retrieve the inflation data from World Bank

The Pandas-datareader has an interface to get data from World Bank. To find interesting data from World Bank you should explore data.worldbank.org, which contains various interesting indicators.

When you find one, like the Inflation, consumer prices (annual %), we will use, you can see that you can download it in CSV, XML, or excel. But we are not old fashioned, hence, we will use the direct API to get fresh data every time we run our program.

To use the API, we need the indicator, which you will find in the url. In this case.

https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG

Hence we have it FP.CPI.TOTL.ZG.

Using the Pandas-datareader API you can get the data by running the following piece of code.

from pandas_datareader import wb
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
print(data)

If you inspect the output, you will see it is structured a bit inconvenient.

                                                         FP.CPI.TOTL.ZG
country                                            year                
Arab World                                         2019        1.336016
Caribbean small states                             2019             NaN
Central Europe and the Baltics                     2019        2.664561
Early-demographic dividend                         2019        3.030587
East Asia & Pacific                                2019        1.773102
East Asia & Pacific (excluding high income)        2019        2.779172
East Asia & Pacific (IDA & IBRD countries)         2019        2.779172

It has two indexes.

We want to reset index 1 (the year) and, which will make year to a column. Then for convenience we should rename the columns.

from pandas_datareader import wb
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
print(data)

Resulting in the following.

                                                    year  inflation
country                                                            
Arab World                                          2019   1.336016
Caribbean small states                              2019        NaN
Central Europe and the Baltics                      2019   2.664561
Early-demographic dividend                          2019   3.030587
East Asia & Pacific                                 2019   1.773102
East Asia & Pacific (excluding high income)         2019   2.779172
East Asia & Pacific (IDA & IBRD countries)          2019   2.779172

Step 2: Retrieve the world map data

The world map data is available from GeoPandas. At first glance everything is easy.

import geopandas
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
print(map)

Where I excluded Antarctica for visual purposes. Inspecting some of the output.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...
6      29748859                     Asia                Uzbekistan    UZB    202300.00  POLYGON ((55.96819 41.30864, 55.92892 44.99586...

It seems to be a good match to join the data on the name column.

To make it easy, we can make the name column index.

import geopandas
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')

Step 3: Joining the datasets

This is the fun part of Data Science. Why? I am glad you asked. Well, it was an irony. The challenge will be apparent in a moment. There are various ways to deal with it, but in this tutorial we will use a simplistic approach.

Let us do the join.

from pandas_datareader import wb
import geopandas
pd.set_option('display.width', 3000)
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', 500)
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
map = map.join(data, how='outer')
print(map)

Where I use an outer join, to get all the “challenges” visible.

Russia                                              1.422575e+08                   Europe    RUS   3745000.00  MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...   NaN        NaN
Russian Federation                                           NaN                      NaN    NaN          NaN                                               None  2019   4.470367
...
United States                                                NaN                      NaN    NaN          NaN                                               None  2019   1.812210
United States of America                            3.266258e+08            North America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...   NaN        NaN

Where I only took two snippets. The key thing is here, that the data from GeoPandas, containing the map, and data from World Bank, containing the inflation rates we want to color the map with, are not joined.

Hence, we need to join United States together with United States of America. And Russia with Russian Federation.

We would use a location service, which maps counties to country codes. Hence, mapping each data sets country names to country codes (note that GeoPandas already has 3 letter country codes, but some are missing, like Norway and more). This approach still can have some missing pieces, as some country names are not known by the mapping.

Another approach is to look find all the data not mapped and rename them in one of the datasets. This can take some time, but I did most of them in the following.

from pandas_datareader import wb
import geopandas
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
index_change = {
    'United States of America': 'United States',
    'Yemen': 'Yemen, Rep.',
    'Venezuela': 'Venezuela, RB',
    'Syria': 'Syrian Arab Republic',
    'Solomon Is.': 'Solomon Islands',
    'Russia': 'Russian Federation',
    'Iran': 'Iran, Islamic Rep.',
    'Gambia': 'Gambia, The',
    'Kyrgyzstan': 'Kyrgyz Republic',
    'Mauritania': 'Mauritius',
    'Egypt': 'Egypt, Arab Rep.'
}
map = map.rename(index=index_change)
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
map = map.join(data, how='outer')

Step 4: Making a Choropleth map based on our dataset

The simple plot of the data will not be very insightful. But let’s try that first.

map.plot('inflation')
plt.title("Inflation 2019")
plt.show()

Resulting in the following.

The default result.

A good way to get inspiration is to check out the documentation with examples.

From the GeoPandas documentation

Where you see a cool color map with scheme=’quantiles’. Let’s try that.

map.plot('inflation', cmap='OrRd', scheme='quantiles')
plt.title("Inflation 2019")
plt.show()

Resulting in the following.

Closer

Adding grey tone to countries not mapped, adding a legend, setting the size. Then we are done. The full source code is here.

from pandas_datareader import wb
import geopandas
import pandas as pd
import matplotlib.pyplot as plt
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
index_change = {
    'United States of America': 'United States',
    'Yemen': 'Yemen, Rep.',
    'Venezuela': 'Venezuela, RB',
    'Syria': 'Syrian Arab Republic',
    'Solomon Is.': 'Solomon Islands',
    'Russia': 'Russian Federation',
    'Iran': 'Iran, Islamic Rep.',
    'Gambia': 'Gambia, The',
    'Kyrgyzstan': 'Kyrgyz Republic',
    'Mauritania': 'Mauritius',
    'Egypt': 'Egypt, Arab Rep.'
}
map = map.rename(index=index_change)
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
map = map.join(data, how='outer')
map.plot('inflation', cmap='OrRd', scheme='quantiles', missing_kwds={"color": "lightgrey"}, legend=True, figsize=(14,5))
plt.title("Inflation 2019")
plt.show()

Resulting in the following output.

Inflation data from World Bank mapped on a Choropleth map using GeoPandas and MatPlotLib.

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test – Part II

What will we cover in this tutorial?

We will continue our journey to explore a big dataset of 145,000+ respondents to a RIASEC test. If you want to explore the full journey, we recommend you read this tutorial first.

In this tutorial we will find some data points that are not correct and a potential way to deal with it.

Step 1: Explore the family sizes from the respondents

In the first tutorial we looked at how the respondent were distributed around the world. Surprisingly, most countries were represented.

From previous tutorial.

In this we will explore the dataset further. The dataset is available here.

import pandas as pd
# Only to get a broader summary
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)

Which will output the following.

        R1  R2  R3  R4  R5  R6  R7  R8  I1  I2  I3  I4  I5  I6  I7  ...  gender  engnat  age  hand  religion  orientation  race  voted  married  familysize  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1   4   1   3   5   5   4   3   4   5   4  ...       1       1   14     1         7            1     1      2        1           1                      1       US       2                  NaN          NaN
1        1   1   2   4   1   2   2   1   5   5   5   4   4   4   4  ...       1       1   29     1         7            3     4      1        2           3                      1       US       1              Nursing          NaN
2        2   1   1   1   1   1   1   1   4   1   1   1   1   1   1  ...       2       1   23     1         7            1     4      2        1           1                      1       US       1                  NaN          NaN
3        3   1   1   2   2   2   2   2   4   1   2   4   3   2   3  ...       2       2   17     1         0            1     1      2        1           1                      1       CN       0                  NaN          NaN
4        4   1   1   2   1   1   1   2   5   5   5   3   5   5   5  ...       2       2   18     1         4            3     1      2        1           4                      1       PH       0            education          NaN

If you use the slider, I got curious about how family sizes vary around the world. This dataset is obviously not representing any conclusive data on it, but it could be interesting to see if there is any connection to where you are located in the world and family size.

Step 2: Explore the distribution of family sizes

What often happens in dataset is there might be inaccurate data.

To get a feeling of the data in the column familysize, you can explore it by running this.

import pandas as pd

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data['familysize'].describe())
print(pd.cut(data['familysize'], bins=[0,1,2,3,4,5,6,7,10,100, 1000000000]).value_counts())

Resulting in the following from the describe output.

count    1.458280e+05
mean     1.255801e+05
std      1.612271e+07
min      0.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      3.000000e+00
max      2.147484e+09
Name: familysize, dtype: float64

Where the mean value of family size is 125,580. Well, maybe we don’t count family size the same way, but something is wrong there.

Grouping the data into bins (by using the cut function combined with value_count) you get this output.

(1, 2]               51664
(2, 3]               38653
(3, 4]               18729
(0, 1]               15901
(4, 5]                8265
(5, 6]                3932
(6, 7]                1928
(7, 10]               1904
(10, 100]              520
(100, 1000000000]       23
Name: familysize, dtype: int64

Which indicates 23 families of size greater than 100. Let’s just investigate the sizes in that bucket.

print(data[data['familysize'] > 100]['familysize'])

Giving us this output.

1212      2147483647
3114      2147483647
5770      2147483647
8524             104
9701             103
21255     2147483647
24003            999
26247     2147483647
27782     2147483647
31451           9999
39294           9045
39298          84579
49033            900
54592            232
58773     2147483647
74745      999999999
78643            123
92457            999
95916            908
102680           666
109429           989
111488       9234785
120489          5000
120505     123456789
122580          5000
137141           394
139226          3425
140377           934
142870    2147483647
145686           377
145706           666
Name: familysize, dtype: int64

The integer 2147483647 is interesting as it is the maximum 32-bit positive integer. I think it is safe to say that most family sizes given above 100 are not realistic.

Step 3: Clean the data

You need to make a decision on these data points that seem to skew your data in a wrong way.

Say, you just decide to visualize it without any adjustment, it would give a misrepresentative picture.

Iceland? What’s up?

It seems like Iceland has a tradition for big families.

Let’s investigate that.

print(data[data['country'] == 'IS']['familysize'])

Interestingly it give only one line that does not seem correct.

74745     999999999

But as there are only a few respondents the average is the highest.

To clean the data fully, we can make the decision that family sizes above 10 are not correct. I know, that might be set a bit low and you can choose to do something different.

Cleaning the data is simple.

data = data[data['familysize'] < 10]

Magic right? You simply write a conditional that will be vectorized down and only keep those rows of data that fulfill this condition.

Step 4: Visualize the data

We will use geopandas, matplotlib and pycountry to visualize it. The process is similar to the one in previous tutorial where you can find more details.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)

data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
data = data[data['familysize'] < 10]
country_mean = data.groupby(['alpha3']).mean()
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_mean, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('familysize', figsize=(12,4), legend=True)
plt.show()

Resulting in the following output.

Family sizes of the respondents

Looks like there is a one-child policy in China? Again, do not make any conclusions on this data as it is very narrow of this aspect.

Read the next part here:

Pandas: Explore Datasets by Visualization – Exploring the Holland Code (RIASEC) Test

What will we cover in this tutorial

We will explore a dataset with the Holland Code (RIASEC) Test, which is a test that should predict careers and vocational choices by rating questions.

In this part of the exploration, we first focus on loading the data and visualizing where the respondents come from. The dataset contains more than 145,000 responses.

You can download the dataset here.

Step 1: First glance at the data

Let us first try to see what the data contains.

Reading the codebook (the file with the dataset) you see it contains ratings of questions of the 6 categories RIASEC. Then there are 3 elapsed times for the test.

There is a ratings of The Ten Item Personality Inventory. Then a self assessment whether they know 16 words. Finally, a list if metadata on them, like where the respondent network was located (which is a indicator on where the respondent was located in most cases).

Other metadata can be seen explained here.

education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 1=Asian, 2=Arab, 3=Black, 4=Indigenous Australian / Native American / White, 5=Other (There was a coding error in the survey, and three different options were given the same value)
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"		
major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

These values were also calculated for technical information:
uniqueNetworkLocation	1 if the record is the only one from its network location in the dataset, 2 if there are more than one record. There can be more than one record from the same network if for example that network is shared by a school etc, or it may be because of test retakes
country	The country of the network the user connected from
source	1=from Google, 2=from an internal link on the website, 0=from any other website or could not be determined

Step 2: Loading the data into a DataFrame (Pandas)

First step would be to load the data into a DataFrame. If you are new to Pandas DataFrame, we can recommend this tutorial.

import pandas as pd

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 150)
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
print(data)

The pd.set_option are only to help get are more rich output, compared to a very small and narrow summary. The actual loading of the data is done by pd.read_csv(…).

Notice that we have renamed the csv file to riasec.csv. As it is a tab-spaced csv, we need to parse that as an argument if it is not using the default comma.

The output from the above code is.

        R1  R2  R3  R4  R5  ...  uniqueNetworkLocation  country  source                major  Unnamed: 93
0        3   4   3   1   1  ...                      1       US       2                  NaN          NaN
1        1   1   2   4   1  ...                      1       US       1              Nursing          NaN
2        2   1   1   1   1  ...                      1       US       1                  NaN          NaN
3        3   1   1   2   2  ...                      1       CN       0                  NaN          NaN
4        4   1   1   2   1  ...                      1       PH       0            education          NaN
...     ..  ..  ..  ..  ..  ...                    ...      ...     ...                  ...          ...
145823   2   1   1   1   1  ...                      1       US       1        Communication          NaN
145824   1   1   1   1   1  ...                      1       US       1              Biology          NaN
145825   1   1   1   1   1  ...                      1       US       2                  NaN          NaN
145826   3   4   4   5   2  ...                      2       US       0                  yes          NaN
145827   2   4   1   4   2  ...                      1       US       1  Information systems          NaN

Interestingly, the dataset contains an unnamed last column with no data. That is because it ends each line with a tab (\t) before new line (\n).

We could clean that up, but as we are only interested in the country counts, we will ignore it in this tutorial.

Step 3: Count the occurrences of each country

As said, we are only interested in this first tutorial on this dataset to get an idea of where the respondents come from in the world.

The data is located in the ‘country’ column of the DataFrame data.

To group the data, you can use groupby(), which will return af DataFrameGroupBy object. If you apply a size() on that object, it will return a Series with sizes of each group.

print(data.groupby(['country']).size())

Where the first few lines are.

country
AD          2
AE        507
AF          8
AG          7
AL        116
AM         10

Hence, for each country we will have a count of how many respondents came from that country.

Step 4: Understand the map data we want to merge it with

To visualize the data, we need some way to have a map.

Here the GeoPandas comes in handy. It contains a nice low-res map of the world you can use.

Let’s just explore that.

import geopandas
import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
world.plot()
plt.show()

Which will make the following map.

World map using GeoPandas and Maplotlib

This is too easy to be true. No, not really. This is the reality of Python.

We want to merge the data from out world map above with the data of counts for each country.

We need to see how to merge it. To do that let us look at the data from world.

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
print(world)

Where the first few lines are.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

First problem arises here. In the other dataset we have 2 letter country codes, in this one they use 3 letter country codes.

Step 5: Solving the merging problem

Luckily we can use a library called PyCountry.

Let’s add this 3 letter country code to our first dataset by using a lambda function. A lambda? New to lambda function, we recommend you read the this tutorial.

import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country
data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)

Basically, we add a new column to the dataset and call it ‘alpha3’ with the three letter country code. We use the function apply, which takes the lambda function that actually calls the function outside, which calls the library.

The reason to so, is that sometimes the pycountry.contries calls makes a lookup exception. We want our program to be robust to that.

Now the data contains a row with the countries in 3 letters like world.

We can now merge the data together. Remember that the data we want to merge needs to be adjusted to be counting on ‘alpha3’ and also we want to convert it to a DataFrame (as size() returns a Series).

import geopandas
import pandas as pd
import pycountry

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
print(map)

The first few lines are given below.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry    count  \
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...     12.0   
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...      9.0   
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...      NaN   
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...   7256.0   
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  80579.0   
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...     46.0   

Notice, that some countries do not have a count. Those a countries with no respondent.

Step 6: Ready to plot a world map

Now to the hard part, right?

Making a colorful map indicating the number of respondents in a given country.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('count', figsize=(10,3), legend=True)
plt.show()

It is easy. Just call plot(…) with the first argument to be the column to use. I also change the default figsize, you can play around with that. Finally I add the legend.

The output

Not really satisfying. The problem is that all counties, but USA, have almost identical colors. Looking at the data, you will see that it is because that there are so many respondents in USA that the countries are in the bottom of the scale.

What to do? Use a log-scale.

You can actually do that directly in your DataFrame. By using a NumPy library we can calculate a logarithmic scale.

See the magic.

import geopandas
import pandas as pd
import matplotlib.pyplot as plt
import pycountry
import numpy as np

# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country

data = pd.read_csv('riasec.csv', delimiter='\t', low_memory=False)
data['alpha3'] = data.apply(lambda row: lookup_country_code(row['country']), axis=1)
country_count = data.groupby(['alpha3']).size().to_frame()
country_count.columns = ['count']
country_count['log_count'] = np.log(country_count['count'])
world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
map = world.merge(country_count, how='left', left_on=['iso_a3'], right_on=['alpha3'])
map.plot('log_count', figsize=(10,3), legend=True)
plt.show()

Where the new magic is to add the log_count and using np.log(country_count[‘count’]).

Also notice that the plot is now done on ‘log_count’.

The final output.

Now you see more of a variety in the countries respondents. Note that the “white” countries did not have any respondent.

Read the next exploration of the dataset here.

Next exploration.