Video Mosaic on Live Webcam Stream with OpenCV and Numba

What will we cover in this tutorial?

We will investigate if we can create a decent video mosaic effect on a live webcam stream using OpenCV, Numba and Python. First we will learn the simple way to create a video mosaic and investigate the performance of that. Then we will extend that to create a better quality video mosaic and try to improve the performance by lowering the quality.

Step 1: How does simple photo mosaic work?

A photographic mosaic is a photo generated by other small images. A black and white example is given here.

The above is not a perfect example of it as it is generated with speed to get it running smooth from a webcam stream. Also, it is done in gray scale to improve performance.

The idea is to generate the original image (photograph) by mosaic technique by a lot of smaller sampled images. This is done in the above with the original frame of 640×480 pixels and the mosaic is constructed of small images of size 16×12 pixels.

The first thing we want to achieve is to create a simple mosaic. A simple mosaic is when the original image is scaled down and each pixel is then exchanged with one small image with the same average color. This is simple and efficient to do.

On a high level this is the process.

  1. Have a collection C of small images used to create the photographic mosaic
  2. Scale down the photo P you want to create a mosaic of.
  3. For each pixel in photo P find the image I from C that has the closed average color as the pixel. Insert image I to represent that pixel.

This explains the simple way of doing. The next question is, will it be efficient enough to have a live webcam stream processed?

Step 2: Create a collection of small images

To optimize performance we have chosen to make it in gray scale. The first step is to collect images you want to use. This can be any pictures.

We have used photos from Pexels, which are all free for use without copyright.

What we need is to convert them all to gray scale and resize to fit our purpose.

import cv2
import glob
import os
import numpy as np

output = "small-pics-16x12"
path = "pics"
files = glob.glob(os.path.join(path, "*"))
for file_name in files:
    print(file_name)
    img = cv2.imread(file_name)
    img = cv2.resize(img, (16, 12))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    mean = np.mean(img)
    output_file_name = "image-" + str(mean).replace('.', '-') + ".jpg"
    output_file_name = os.path.join(output, output_file_name)
    print(output_file_name)
    cv2.imwrite(output_file_name, img)

The script assumes that we have located the images we want to convert to gray scale and resize are located in the local folder pics. Further, we assume that the output images (the processed images) will be put in an already existing folder small-pics-16×12.

Step 3: Get a live stream from the webcam

On a high level a live stream from a webcam is given in the following diagram.

This process framework is given in the code below.

import cv2
import numpy as np


def process(frame):
    return frame


def main():
    # Get the webcam (default webcam is 0)
    cap = cv2.VideoCapture(0)
    # If your webcam does not support 640 x 480, this will find another resolution
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

    while True:
        # Read the a frame from webcam
        _, frame = cap.read()
        # Flip the frame
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (640, 480))
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Update the frame
        updated_frame = process(gray)

        # Show the frame in a window
        cv2.imshow('WebCam', updated_frame)

        # Check if q has been pressed to quit
        if cv2.waitKey(1) == ord('q'):
            break

    # When everything done, release the capture
    cap.release()
    cv2.destroyAllWindows()


main()

The above code is just an empty shell where the function call to process is where the all the processing will be. This code will just generate a window that shows a gray scale image.

Step 4: The simple video mosaic

We need to introduce two main things to create this simple video mosaic.

  1. Loading all the images we need to use (the 16×12 gray scale images).
  2. Fill out the processing of each frame, which replaces each 16×12 box of the frame with the best matching image.

The first step is preprocessing and should be done before we enter the main loop of the webcam capturing. The second part is done in each iteration inside the process function.

import cv2
import numpy as np
import glob
import os


def preprocess():
    path = "small-pics-16x12"
    files = glob.glob(os.path.join(path, "*"))
    files.sort()
    images = []
    for filename in files:
        img = cv2.imread(filename)
        images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
    return np.stack(images)


def process(frame, images, box_height=12, box_width=16):
    height, width = frame.shape
    for i in range(0, height, box_height):
        for j in range(0, width, box_width):
            roi = frame[i:i + box_height, j:j + box_width]
            mean = np.mean(roi[:, :])
            roi[:, :] = images[int((len(images)-1)*mean/256)]
    return frame


def main(images):
    # Get the webcam (default webcam is 0)
    cap = cv2.VideoCapture(0)
    # If your webcam does not support 640 x 480, this will find another resolution
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

    while True:
        # Read the a frame from webcam
        _, frame = cap.read()
        # Flip the frame
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (640, 480))
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Update the frame
        mosaic_frame = process(gray, images)

        # Show the frame in a window
        cv2.imshow('Mosaic Video', mosaic_frame)
        cv2.imshow('Webcam', frame)

        # Check if q has been pressed to quit
        if cv2.waitKey(1) == ord('q'):
            break

    # When everything done, release the capture
    cap.release()
    cv2.destroyAllWindows()



images = preprocess()
main(images)

The preprocessing function reads all the images, converts them to gray scale (to have only 1 channel per pixel), and returns them as a NumPy array to have optimized code.

The process function takes and breaks down the image in blocks of 16×12 pixels, computes the average gray scale, and takes the estimated best match. Notice the average (mean) value is a float, hence, we can have more than 256 gray scale images.

In this example we used 1.885 images to process it.

A result can be seen here.

The result is decent but not good.

Step 5: Testing the performance and improve it by using Numba

While the performance is quite good, let us test it.

We do that by using the time library.

First you need to import the time library.

import time

Then time the actual time the process call uses. New code inserted in the main while loop.

        # Update the frame
        start = time.time()
        mosaic_frame = process(gray, images)
        print("Process time", time.time()- start, "seconds")

This will result in the following output.

Process time 0.02651691436767578 seconds
Process time 0.026834964752197266 seconds
Process time 0.025418996810913086 seconds
Process time 0.02562689781188965 seconds
Process time 0.025369882583618164 seconds
Process time 0.025450944900512695 seconds

Or a few lines from it. About 0.025-0.027 seconds.

Let’s try to use Numba in the equation. Numba is a just-in-time compiler for NumPy code. That means it compiles to python code to a binary for speed. If you are new to Numba we recommend you read this tutorial.

import cv2
import numpy as np
import glob
import os
import time
from numba import jit


def preprocess():
    path = "small-pics-16x12"
    files = glob.glob(os.path.join(path, "*"))
    files.sort()
    images = []
    for filename in files:
        img = cv2.imread(filename)
        images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
    return np.stack(images)


@jit(nopython=True)
def process(frame, images, box_height=12, box_width=16):
    height, width = frame.shape
    for i in range(0, height, box_height):
        for j in range(0, width, box_width):
            roi = frame[i:i + box_height, j:j + box_width]
            mean = np.mean(roi[:, :])
            roi[:, :] = images[int((len(images)-1)*mean/256)]
    return frame


def main(images):
    # Get the webcam (default webcam is 0)
    cap = cv2.VideoCapture(0)
    # If your webcam does not support 640 x 480, this will find another resolution
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

    while True:
        # Read the a frame from webcam
        _, frame = cap.read()
        # Flip the frame
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (640, 480))
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Update the frame
        start = time.time()
        mosaic_frame = process(gray, images)
        print("Process time", time.time()- start, "seconds")

        # Show the frame in a window
        cv2.imshow('Mosaic Video', mosaic_frame)
        cv2.imshow('Webcam', frame)

        # Check if q has been pressed to quit
        if cv2.waitKey(1) == ord('q'):
            break

    # When everything done, release the capture
    cap.release()
    cv2.destroyAllWindows()



images = preprocess()
main(images)

This gives the following performance.

Process time 0.0014820098876953125 seconds
Process time 0.0013887882232666016 seconds
Process time 0.0015859603881835938 seconds
Process time 0.0016350746154785156 seconds
Process time 0.0018379688262939453 seconds
Process time 0.0016241073608398438 seconds

Which is a factor 15-20 speed improvement.

Good enough for live streaming. But the result is still not decent.

Step 6: A more advanced video mosaic approach

The more advanced video mosaic consist of approximating the each replacement box of pixels by the replacement image pixel by pixel.

import cv2
import numpy as np
import glob
import os
import time
from numba import jit


def preprocess():
    path = "small-pics-16x12"
    files = glob.glob(os.path.join(path, "*"))
    files.sort()
    images = []
    for filename in files:
        img = cv2.imread(filename)
        images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
    return np.stack(images)


@jit(nopython=True)
def process(frame, images, box_height=12, box_width=16):
    height, width = frame.shape
    for i in range(0, height, box_height):
        for j in range(0, width, box_width):
            roi = frame[i:i + box_height, j:j + box_width]
            best_match = np.inf
            best_match_index = 0
            for k in range(1, images.shape[0]):
                total_sum = np.sum(np.where(roi > images[k], roi - images[k], images[k] - roi))
                if total_sum < best_match:
                    best_match = total_sum
                    best_match_index = k
            roi[:,:] = images[best_match_index]
    return frame


def main(images):
    # Get the webcam (default webcam is 0)
    cap = cv2.VideoCapture(0)
    # If your webcam does not support 640 x 480, this will find another resolution
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

    while True:
        # Read the a frame from webcam
        _, frame = cap.read()
        # Flip the frame
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (640, 480))
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Update the frame
        start = time.time()
        mosaic_frame = process(gray, images)
        print("Process time", time.time()- start, "seconds")

        # Show the frame in a window
        cv2.imshow('Mosaic Video', mosaic_frame)
        cv2.imshow('Webcam', frame)

        # Check if q has been pressed to quit
        if cv2.waitKey(1) == ord('q'):
            break

    # When everything done, release the capture
    cap.release()
    cv2.destroyAllWindows()


images = preprocess()
main(images)

There is one line to notice specifically.

total_sum = np.sum(np.where(roi > images[k], roi - images[k], images[k] - roi))

Which is needed, as we work with unsigned 8 bit integers. What it does is, that it takes the and calculates the difference between each pixel in the region of interest (roi) and the image[k]. This is a very expensive calculation as we will see.

Performance shows the following.

Process time 7.030380010604858 seconds
Process time 7.034134149551392 seconds
Process time 7.105709075927734 seconds
Process time 7.138839960098267 seconds

Over 7 seconds for each frame. The result is what can be expected by using this amount of images, but the performance is too slow to have a flowing smooth live webcam stream.

The result can be seen here.

Step 7: Compromise options

There are various options to compromise for speed and we will not investigate all. Here are some.

  • Use fever images in our collection (use less than 1.885 images). Notice, that using half the images, say 900 images, will only speed up 50%.
  • Bigger image sizes. Scaling up to use 32×24 images. Here we will still need to do a lot of processing per pixel still. Hence, the expected speedup might be less than expected.
  • Make a compromised version of the difference calculation (total_sum). This has great potential, but might have undesired effects.
  • Scale down pixel estimation for fever calculations.

We will try the last two.

First, let’s try to exchange the calculation of total_sum, which is our distance function that measures how close our image is. Say, we use this.

                total_sum = np.sum(np.subtract(roi, images[k]))

This results in overflow if we have a calculation like 1 – 2 = 255, which is undesired. On the other hand. It might happen in expected 50% of the cases, and maybe it will skew the calculation evenly for all images.

Let’s try.

Process time 1.857623815536499 seconds
Process time 1.7193729877471924 seconds
Process time 1.7445549964904785 seconds
Process time 1.707035779953003 seconds
Process time 1.6778359413146973 seconds

Wow. That is a speedup of a factor 4-6 per frame. The quality is still fine, but you will notice a poorly mapped image from time to time. But the result is close to the advanced video mosaic and far from the first simple video mosaic.

Another addition we could make is to estimate each box by only 4 pixels. This should still be better than the simple video mosaic approach. I have given the full code below.

import cv2
import numpy as np
import glob
import os
import time
from numba import jit


def preprocess():
    path = "small-pics-16x12"
    files = glob.glob(os.path.join(path, "*"))
    files.sort()
    images = []
    for filename in files:
        img = cv2.imread(filename)
        images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
    return np.stack(images)


def preprocess2(images, scale_width=8, scale_height=6):
    scaled = []
    _, height, width = images.shape
    print("Dimensions", width, height)
    width //= scale_width
    height //= scale_height
    print("Scaled Dimensions", width, height)
    for i in range(images.shape[0]):
        scaled.append(cv2.resize(images[i], (width, height)))
    return np.stack(scaled)


@jit(nopython=True)
def process3(frame, frame_scaled, images, scaled, box_height=12, box_width=16, scale_width=8, scale_height=6):
    height, width = frame.shape
    width //= scale_width
    height //= scale_height
    box_width //= scale_width
    box_height //= scale_height
    for i in range(0, height, box_height):
        for j in range(0, width, box_width):
            roi = frame_scaled[i:i + box_height, j:j + box_width]
            best_match = np.inf
            best_match_index = 0
            for k in range(1, scaled.shape[0]):
                total_sum = np.sum(roi - scaled[k])
                if total_sum < best_match:
                    best_match = total_sum
                    best_match_index = k
            frame[i*scale_height:(i + box_height)*scale_height, j*scale_width:(j + box_width)*scale_width] = images[best_match_index]
    return frame


def main(images, scaled):
    # Get the webcam (default webcam is 0)
    cap = cv2.VideoCapture(0)
    # If your webcam does not support 640 x 480, this will find another resolution
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

    while True:
        # Read the a frame from webcam
        _, frame = cap.read()
        # Flip the frame
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (640, 480))
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Update the frame
        start = time.time()
        gray_scaled = cv2.resize(gray, (640//8, 480//6))
        mosaic_frame = process3(gray, gray_scaled, images, scaled)
        print("Process time", time.time()- start, "seconds")

        # Show the frame in a window
        cv2.imshow('Mosaic Video', mosaic_frame)
        cv2.imshow('Webcam', frame)

        # Check if q has been pressed to quit
        if cv2.waitKey(1) == ord('q'):
            break

    # When everything done, release the capture
    cap.release()
    cv2.destroyAllWindows()


images = preprocess()
scaled = preprocess2(images)
main(images, scaled)

Where there is added preprocessing step (preprocess2). The process time is now.

Process time 0.5559628009796143 seconds
Process time 0.5979928970336914 seconds
Process time 0.5543379783630371 seconds
Process time 0.5621011257171631 seconds

Which is okay, but still less than 2 frames per seconds.

The result can be seen here.

It is not all bad. It is still better than the simple video mosaic approach.

The result is not perfect. If you want to use it on a live webcam stream with 25-30 frames per seconds, you need to find further optimizations of live with the simple mosaic video approach.

Performance comparison of Numba vs Vectorization vs Lambda function with NumPy

What will we cover in this tutorial?

We will continue our investigation of Numba from this tutorial.

Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.

Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.

Step 1: Example of Vectorization slower than Numba

In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.

import numpy as np
from numba import jit
import time

size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000


@jit(nopython=True)
def add_numba(a, b):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + b[i, j]
    return c


def add_vectorized(a, b):
    return a + b


# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
    z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Varying the size of the NumPy array, we can see the performance between the two in the graph below.

Where it is clear that the vectorized approach is slower.

Step 2: Try some more complex example comparing vectorized and Numba

A if-then-else can be expressed as vectorized using the Numpy where function.

import numpy as np
from numba import jit
import time


size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if a[i, j] < 0.5:
                c[i, j] = 1
    return c


def vectorized(a):
    return np.where(a < 0.5, 1, 0)


# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

This results in the following comparison.

That is close, but the vectorized approach is a bit faster.

Step 3: Compare Numba with lambda functions

I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.

import numpy as np
from numba import jit
import time

size = 1000
x = np.random.rand(size, size)
iterations = 1000


@jit(nopython=True)
def numba(a):
    c = np.zeros((size, size))
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + 1
    return c


def lambda_run(a):
    return a.apply(lambda x: x + 1)


# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))

start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Resulting in the following performance comparison.

This is again tight, but the lambda approach is still a bit faster.

Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.

Conclusion

Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.

When to use Numba with Python NumPy: Vectorization vs Numba

What will we cover in this tutorial?

You just want your code to run fast, right? Numba is a just-in-time compiler for Python that works amazingly with NumPy. Does that mean we should alway use Numba?

Well, let’s try some examples out and learn. If you know about NumPy, you know you should use vectorization to get speed. Does Numba beat that?

Step 1: Let’s learn how Numba works

Numba will compile the Python code into machine code and run it. What about the just-in-time compiler? That means, the first time it uses the code you want to turn into machine code, it will compile it and run it. The next, or any time later, it will just run it, as it is already compiled.

Let’s try that.

import numpy as np
from numba import jit
import time


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

Where you get.

Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125

Where you see a difference in runtime.

Oh, did you get what happened in the code? Well, if you put @jit(nopython=True) in front of a function, Numba will try to compile it and run it as machine code.

As you see above, the first time as has an overhead in run-time, because it first compiles and the runs it. The second time, it already has compiled it and can run it immediately.

Step 2: Compare Numba just-in-time code to native Python code

So let us compare how much you gain by using Numba just-in-time (@jit) in our code.

import numpy as np
from numba import jit
import time


def full_sum(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))

Here we added a native Python function without the @jit in front and will compare it with one which has. We will compare it here.

Elapsed (No Numba) = 38.08543515205383
Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125

That is some difference. Also, we have plotted a few more runs in the graph below.

It seems pretty evident.

Step 3: Comparing it with Vectorization

If you don’t know what vectorization is, we can recommend this tutorial. The reason to have vectorization is to move the expensive for-loops into the function call to have optimized code run it.

That sounds a lot like what Numba can do. It can change the expensive for-loops into fast machine code.

But which one is faster?

Well, I think there are two parameters to try out. First, the size of the problem. Second, to see if the number of iterations matter.

import numpy as np
from numba import jit
import time


@jit(nopython=True)
def full_sum_numba(a):
    sum = 0.0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            sum += a[i, j]
    return sum


def full_sum_vectorized(a):
    return a.sum()


iterations = 1000
size = 10000
x = np.random.rand(size, size)

start = time.time()
full_sum_vectorized(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))

As a function of the size.

It is interesting that Numba is faster for small sized of the problem, while it seems like the vectorized approach outperforms Numba for bigger sizes.

And not surprisingly, the number of iterations only makes the difference bigger.

This is not surprising, as the code in a vectorized call can be more specifically optimized than the more general purpose Numba approach.

Conclusion

Does that mean the Numba does not pay off to use?

No, not at all. First of all, we have only tried it for one vectorized approach, which was obviously very easy to optimize. Secondly, not all loops can be turned into vectorized code. In general it is difficult to have a state in a vectorized approach. Hence, if you need to keep track of some internal state in a loop it can be difficult to find a vectorized approach.

Master Markowitz Portfolio Optimization (Efficient Frontier) in Python using Pandas

What is Markowitz Portfolios Optimization (Efficient Frontier)?

The Efficient Frontier takes a portfolio of investments and optimizes the expected return in regards to the risk. That is to find the optimal return for a risk.

According to investopedia.org the return is based on the expected Compound Annual Growth Rate (CAGR) and risk metric is the standard deviation of the return.

But what does all that mean? We will learn that in this tutorial.

Step 1: Get the time series of your stock portfolio

We will use the following portfolio of 4 stocks of Apple (AAPL), Microsoft (MSFT), IBM (IBM) and Nvidia (NVDA).

To get the time series we will use the Yahoo! Finance API through the Pandas-datareader.

We will look 5 years back.

import pandas_datareader as pdr
import pandas as pd
import datetime as dt
from dateutil.relativedelta import relativedelta

years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL','MSFT','IBM','NVDA']
for ticker in tickers:
  tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
  close_price[ticker] = tmp['Close']

print(close_price)

Resulting in the following output (or the first few lines).

                  AAPL        MSFT         IBM        NVDA
Date                                                      
2015-08-25  103.739998   40.470001  140.960007   20.280001
2015-08-26  109.690002   42.709999  146.699997   21.809999
2015-08-27  112.919998   43.900002  148.539993   22.629999
2015-08-28  113.290001   43.930000  147.979996   22.730000
2015-08-31  112.760002   43.520000  147.889999   22.480000

It will contain all the date time series for the last 5 years from current date.

Step 2: Calculate the CAGR, returns, and covariance

To calculate the expected return, we use the Compound Average Growth Rate (CAGR) based on the last 5 years. The CAGR is used as investopedia suggest. An alternative that also is being used is the mean of the returns. The key thing is to have some common measure of the return.

The CAGR is calculated as follows.

CAGR = (end-price/start-price)^(1/years) – 1

We will also calculate the covariance as we will use that the calculate the variance of a weighted portfolio. Remember that the standard deviation is given by the following.

sigma = sqrt(variance)

A portfolio is a vector w with the balances of each stock. For example, given w = [0.2, 0.3, 0.4, 0.1], will say that we have 20% in the first stock, 30% in the second, 40% in the third, and 10% in the final stock. It all sums up to 100%.

Given a weight w of the portfolio, you can calculate the variance of the stocks by using the covariance matrix.

variance = w^T Cov w

Where Cov is the covariance matrix.

This results in the following pre-computations.

returns = close_price/close_price.shift(1)
cagr = (close_price.iloc[-1]/close_price.iloc[0])**(1/years) - 1
cov = returns.cov()

print(cagr)
print(cov)

Where you can see the output here.

# CACR:
AAPL    0.371509
MSFT    0.394859
IBM    -0.022686
NVDA    0.905011
dtype: float64

# Covariance
          AAPL      MSFT       IBM      NVDA
AAPL  0.000340  0.000227  0.000152  0.000297
MSFT  0.000227  0.000303  0.000164  0.000306
IBM   0.000152  0.000164  0.000260  0.000210
NVDA  0.000297  0.000306  0.000210  0.000879

Step 3: Plot the return and risk

This is where the power of computing comes into the picture. The idea is to just try a random portfolio and see how it rates with regards to expected return and risk.

It is that simple. Make a random weighted distribution of your portfolio and plot the point of expected return (based on our CAGR) and the risk based on the standard deviation calculated by the covariance.

import matplotlib.pyplot as plt
import numpy as np

def random_weights(n):
    k = np.random.rand(n)
    return k / sum(k)

exp_return = []
sigma = []
for _ in range(20000):
  w = random_weights(len(tickers))
  exp_return.append(np.dot(w, cagr.T))
  sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))

plt.plot(sigma, exp_return, 'ro', alpha=0.1) 
plt.show()

We introduce a helper function random_weights, which returns a weighted portfolio. That is, it returns a vector with entries that sum up to one. This will give a way to distribute our portfolio of stocks.

Then we iterate 20.000 times (could be any value, just want to have enough to plot our graph), where we make a random weight w, then calculate the expected return by the dot-product of w and cagr-transposed. This is done by using NumPy’s dot-product function.

What a dot-product of np.dot(w, cagr.T) does is to take elements pairwise from w and cagr and multiply them and sum up. The transpose is only about the orientation of it to make it work.

The standard deviation (assigned to sigma) is calculated similar by the formula given in the last step: variance = w^T Cov w (which has dot-products between).

This results in the following graph.

Returns vs risks

This shows a graph which outlines a parabola. The optimal values lie along the upper half of the parabola line. Hence, given a risk, the optimal portfolio is one corresponding on the upper boarder of the filled parabola.

Considerations

The Efficient Frontier gives you a way to balance your portfolio. The above code can by trial an error find such a portfolio, but it still leaves out some consideratoins.

How often should you re-balance? It has a cost to do that.

The theory behind has some assumptions that may not be a reality. As investopedia points out, it assumes that asset returns follow a normal distribution, but in reality returns can be more the 3 standard deviations away. Also, the theory builds upon that investors are rational in their investment, which is by most considered a flawed assumption, as more factors play into the investments.

The full source code

Below here you find the full source code from the tutorial.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd
from dateutil.relativedelta import relativedelta
import matplotlib.pyplot as plt
import numpy as np


years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL', 'MSFT', 'IBM', 'NVDA']
for ticker in tickers:
    tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
    close_price[ticker] = tmp['Close']

returns = close_price / close_price.shift(1)
cagr = (close_price.iloc[-1] / close_price.iloc[0]) ** (1 / years) - 1
cov = returns.cov()

def random_weights(n):
    k = np.random.rand(n)
    return k / sum(k)

exp_return = []
sigma = []
for _ in range(20000):
    w = random_weights(len(tickers))
    exp_return.append(np.dot(w, cagr.T))
    sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))

plt.plot(sigma, exp_return, 'ro', alpha=0.1)
plt.show()

NumPy: Compute Mandelbrot set by Vectorization

What will we cover in this tutorial?

  • Understand what the Mandelbrot set it and why it is so fascinating.
  • Master how to make images in multiple colors of the Mandelbrot set.
  • How to implement it using NumPy vectorization.

Step 1: What is Mandelbrot?

Mandelbrot is a set of complex numbers for which the function f(z) = z^2 + c does not converge when iterated from z=0 (from wikipedia).

Take a complex number, c, then you calculate the sequence for N iterations:

z_(n+1) = z_n + c for n = 0, 1, …, N-1

If absolute(z_(N-1)) < 2, then it is said not to diverge and is part of the Mandelbrot set.

The Mandelbrot set is part of the complex plane, which is colored by numbers part of the Mandelbrot set and not.

Mandelbrot set.

This only gives a block and white colored image of the complex plane, hence often the images are made more colorful by giving it colors by the iteration number it diverged. That is if z_4 diverged for a point in the complex plane, then it will be given the color 4. That is how you end up with colorful maps like this.

Mandelbrot set (made by program from this tutorial).

Step 2: Understand the code of the non-vectorized approach to compute the Mandelbrot set

To better understand the images from the Mandelbrot set, think of the complex numbers as a diagram, where the real part of the complex number is x-axis and the imaginary part is y-axis (also called the Argand diagram).

Argand diagram

Then each point is a complex number c. That complex number will be given a color depending on which iteration it diverges (if it is not part of the Mandelbrot set).

Now the pseudocode for that should be easy to digest.

for x in [-2, 2] do:
  for y in [-1.5, 1.5] do:
    c = x + i*y
    z = 0
    N = 0
    while absolute(z) < 2 and N < MAX_ITERATIONS:
      z = z^2 + c
    set color for x,y to N

Simple enough to understand. That is some of the beauty of it. The simplicity.

Step 3: Make a vectorized version of the computations

Now we understand the concepts behind we should translate that into to a vectorized version. If you are new to vectorization we can recommend you read this tutorial first.

What do we achieve with vectorization? That we compute all the complex numbers simultaneously. To understand that inspect the initialization of all the points here.

import numpy as np

def mandelbrot(height, width, x_from=-2, x_to=1, y_from=-1.5, y_to=1.5, max_iterations=100):
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y

You see that we initialize all the x-coordinates at once using the linespace. It will create an array with numbers from x_from to x_to in width points. The reshape is to fit the plane.

The same happens for y.

Then all the complex numbers are created in c = x + 1j*y, where 1j is the imaginary part of the complex number.

This leaves us to the full implementation.

There are two things we need to keep track of in order to make a colorful Mandelbrot set. First, in which iteration the point diverged. Second, to achieve that, we need to remember when a point diverged.

import numpy as np
import matplotlib.pyplot as plt


def mandelbrot(height, width, x=-0.5, y=0, zoom=1, max_iterations=100):
    # To make navigation easier we calculate these values
    x_width = 1.5
    y_height = 1.5*height/width
    x_from = x - x_width/zoom
    x_to = x + x_width/zoom
    y_from = y - y_height/zoom
    y_to = y + y_height/zoom

    # Here the actual algorithm starts
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y

    # Initialize z to all zero
    z = np.zeros(c.shape, dtype=np.complex128)
    # To keep track in which iteration the point diverged
    div_time = np.zeros(z.shape, dtype=int)
    # To keep track on which points did not converge so far
    m = np.full(c.shape, True, dtype=bool)

    for i in range(max_iterations):
        z[m] = z[m]**2 + c[m]

        diverged = np.greater(np.abs(z), 2, out=np.full(c.shape, False), where=m) # Find diverging

        div_time[diverged] = i      # set the value of the diverged iteration number
        m[np.abs(z) > 2] = False    # to remember which have diverged
    return div_time


# Default image of Mandelbrot set
plt.imshow(mandelbrot(800, 1000), cmap='magma')
# The image below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -0.75, 0.0, 2, 200), cmap='magma')
# The image below of below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -1, 0.3, 20, 500), cmap='magma')
plt.show()

Notice that z[m] = z[m]**2 + c[m] only computes updates on values that are still not diverged.

I have added the following two images from above (the one not commented out is above in previous step.

Mandelbrot set from the program above.
Mandelbrot set from the code above.
Also check out the tutorial on Julia sets.

NumPy: How does Sexual Compulsivity Scale Correlate with Men, Women, or Age?

Background

According to wikipedia, the Sexual Compulsivity Scale (SCS) is a psychometric measure of high libido, hypersexuality, and sexual addiction. While it does not say anything about the score itself, it is based on people rating 10 questions from 1 to 4.

The questions are the following.

Q1. My sexual appetite has gotten in the way of my relationships.				
Q2. My sexual thoughts and behaviors are causing problems in my life.				
Q3. My desires to have sex have disrupted my daily life.				
Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.				
Q5. I sometimes get so horny I could lose control.				
Q6. I find myself thinking about sex while at work.				
Q7. I feel that sexual thoughts and feelings are stronger than I am.				
Q8. I have to struggle to control my sexual thoughts and behavior.				
Q9. I think about sex more than I would like to.				
Q10. It has been difficult for me to find sex partners who desire having sex as much as I want to.

The questions are rated as follows (1=Not at all like me, 2=Slightly like me, 3=Mainly like me, 4=Very much like me).

A dataset of more than 3300+ responses can be found here, which includes the individual rating of each questions, the total score (the sum of ratings), age and gender.

Step 1: First inspection of the data.

Inspection of the data (CSV file)

The first question that pops into my mind is how men and women rate themselves differently. How can we efficiently figure that out?

Welcome to NumPy. It has a built-in csv reader that does all the hard work in the genfromtxt function.

import numpy as np

data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')

# Skip first row as it has description
data = data[1:]

men = data[data[:,11] == 1]
women = data[data[:,11] == 2]

print("Men average", men.mean(axis=0))
print("Women average", women.mean(axis=0))

Dividing into men and women is easy with NumPy, as you can make a vectorized conditional inside the dataset. Men are coded with 1 and women with 2 in column 11 (the 12th column). Finally, a call to mean will do the rest.

Men average [ 2.30544662  2.2453159   2.23485839  1.92636166  2.17124183  3.06448802
  2.19346405  2.28496732  2.43660131  2.54204793 23.40479303  1.
 32.54074074]
Women average [ 2.30959164  2.18993352  2.19088319  1.95916429  2.38746439  3.13010446
  2.18518519  2.2991453   2.4985755   2.43969611 23.58974359  2.
 27.52611586]

Interestingly, according to this dataset (which should be accounted for accuracy, where 21% of answers were not used) women are scoring slighter higher SCS than men.

Men rate highest on the following question:

Q6. I find myself thinking about sex while at work.

While women rate highest on this question.

Q6. I find myself thinking about sex while at work.

The same. Also the lowest is the same for both genders.

Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.

Step 2: Visualize age vs score

I would guess that the SCS score decreases with age. Let’s see if that is the case.

Again, NumPy can do the magic easily. That is prepare the data. To visualize it we use matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python.

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')

# Skip first row as it has description
data = data[1:]

score = data[:,10]
age = data[:,12]
age[age > 100] = 0

plt.scatter(age, score, alpha=0.05)
plt.show()

Resulting in this plot.

Age vs SCS score.

It actually does not look like any correlation. Remember, there are more young people responding to the survey.

Let’s ask NumPy what it thinks about correlation here? Luckily we can do that by calling the corrcoef function, which calculates the Pearson product-moment correlation coefficients.

print("Correlation of age and SCS score:", np.corrcoef(age, score))

Resulting in this output.

Correlation of age and SCS score:
[[1.         0.01046882]
 [0.01046882 1.        ]]

Saying no correlation, as 0.0 – 0.3 is a small correlation, hence, 0.01046882 is close to none. Does that mean the the SCS score does not correlate with age? That our SCS score is static through life?

I do not think we can conclude that based on this small dataset.

Step 3: Bar plot the distribution of scores

It also looked like in the graph we plotted that there was a close to even distribution of scores.

Let’s try to see that. Here we need to sum participants by group. NumPy falls a bit short here. But let’s keep the good mood and use plain old Python lists.

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')

# Skip first row as it has description
data = data[1:]

scores = []
numbers = []
for i in range(10, 41):
    numbers.append(i)
    scores.append(data[data[:, 10] == i].shape[0])

plt.bar(numbers, scores)
plt.show()

Resulting in this bar plot.

Count participants by score.

We knew that the average score was around 23, which could give a potential evenly distribution. But it seems to be a little lower in the far high end of SCS score.

For another great tutorial on NumPy check this one out, or learn some differences between NumPy and Pandas.

NumPy: Analyse Narcissistic Personality Indicator Numerical Dataset

What is Narcissistic Personality Indicator and how does it connect to NumPy?

NumPy is an amazing library that makes analyzing data easy, especially numerical data.

In this tutorial we are going to analyze a survey with 11.000+ respondents from an interactive Narcissistic Personality Indicator (NPI) test.

Narcissism in personality trait generally conceived of as excessive self love. In Greek mythology Narcissus was a man who fell in love with his reflection in a pool of water.

https://openpsychometrics.org/tests/NPI/

The only connection between NPI and NumPy is that we want to analyze the 11.000+ answers.

The dataset can be downloaded here, which consists of a comma separated file, or CSV file for short and a description.

Step 1: Import the dataset and explore it

NumPy has thought of it for us, as simple as magic to load the dataset (in from the link above).

import numpy as np

# This magic line loads the 11.000+ lines of data to a ndarray
data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
print(data)

And we print a summary out.

[[ 18   2   2 ... 211   1  50]
 [  6   2   2 ... 149   1  40]
 [ 27   1   2 ... 168   1  28]
 ...
 [  6   1   2 ... 447   2  33]
 [ 12   2   2 ... 167   1  24]
 [ 18   1   2 ... 291   1  36]]

A good idea is to investigate it from a spreadsheet as well to investigate it.

Spreadsheet

And the far end.

Spreadsheet

Oh, that end.

Then investigate the description from the dataset. (Here we have some of it).

For questions 1=40 which choice they chose was recorded per the following key.
... [The questions Q1 ... Q40]
...
gender. Chosen from a drop down list (1=male, 2=female, 3=other; 0=none was chosen).
age. Entered as a free response. Ages below 14 have been ommited from the dataset.

-- CALCULATED VALUES --
elapse. (time submitted)-(time loaded) of the questions page in seconds.
score. = ((int) $_POST['Q1'] == 1)
... [How it is calculated]

That means we score, answers to questions, elapsed time to answer, gender and age.

Reading a bit more, it says that a high score is an indicator for having narcissistic traits, but one should not conclude that it is one.

Step 2: Men or Women highest NPI?

I’m glad you asked.

import numpy as np

data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]

print("Average score", npi_score.mean())
print("Men average", npi_score[data[:,42] == 1].mean())
print("Women average", npi_score[data[:,42] == 2].mean())
print("None average", npi_score[data[:,42] == 0].mean())
print("Other average", npi_score[data[:,42] == 3].mean())

Before looking at the result, see how nice the data the first column is sliced out to the view in npi_score. Then notice how easy you can calculate the mean based on a conditional rules to narrow the view.

Average score 13.29965311749533
Men average 14.195953307392996
Women average 12.081829626521191
None average 11.916666666666666
Other average 14.85

I guess you guessed it. Men score higher.

Step 3: Is there a correlation between age and NPI score?

I wonder about that too.

How can we figure that out? Wait, let’s ask our new friend NumPy.

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]
age = data[:,43]
# Some age values are not real, so we adjust them to 0
age[age>100] = 0

# Scatter plot them all with alpha=0.05
plt.scatter(age, npi_score, color='r', alpha=0.05)
plt.show()

Resulting in.

Plotting age vs NPI

That looks promising. But can we just conclude that younger people score higher NPI?

What if most respondent are young, then that would make the picture more dense in the younger end (15-30). The danger with your eye is making fast conclusions.

Luckily, NumPy can help us there as well.

print(np.corrcoef(npi_score, age))

Resulting in.

Correlation of NPI score and age:
[[ 1.         -0.23414633]
 [-0.23414633  1.        ]]

What does that mean? Well, looking at the documentation of np.corroef():

Return Pearson product-moment correlation coefficients.

https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html

It has a negative correlation, which means that the younger the higher NPI score. Values between 0.0 and -0.3 are considered low.

Is the Pearson product-moment correlation the correct one to use?

Step 4: (Optional) Let’s try to see if there is a correlation between NPI score and time elapsed

Same code, different column.

import numpy as np
import matplotlib.pyplot as plt


data = np.genfromtxt('data.csv', delimiter=',', dtype='int')

# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]
elapse = data[:,41]
elapse[elapse > 2000] = 2000

# Scatter plot them all with alpha=0.05
plt.scatter(elapse, npi_score, color='r', alpha=0.05)
plt.show()

Resulting in.

Time elapsed in seconds and NPI score

Again, it is tempting to conclude something here. We need to remember that the mean value is around 13, hence, most data will be around there.

If we use the same calculation.

print("Correlation of NPI score and time elapse:")
print(np.corrcoef(npi_score, elapse))

Output.

Correlation of NPI score and time elapse:
[[1.        0.0147711]
 [0.0147711 1.       ]]

Hence, here the there is close to no correlation.

Conclusion

Use the scientific tools to conclude. Do not rely on you eyes to determine whether there is a correlation.

The above gives an idea on how easy it is to work with numerical data in NumPy.

Pandas: How to Sum Groups from HTML Tables

What will we cover in this tutorial?

  • How to collect data from a HTML table into a Pandas DataFrame.
  • The cleaning process and how to convert the data into the correct type.
  • Also, dealing with some data points that are not in correct representation.
  • Finally, how to sum up by countries.

Step 1: Collect the data from the table

Pandas is an amazing library with a lot of useful data analysis functionality right out of the box. First step in any data analysis is to collect the data. In this tutorial we will collect the data from wikipedia’s page on List of metro systems.

If you are new to the pandas library we recommend you read the this tutorial.

The objective will be to find the sums of Stations, Systems length, and Annual ridership per each country.

From wikipedia.org

At first glance this looks simple, but looking further down we see that some countries have various rows.

From wikipedia.org

Also, some rows do not have all the values needed.

First challenge first. Read the data from the table into a DataFrame, which is the main data structure of the pandas library. The read_html call from a pandas will return a list of DataFrames.

If you use read_html for the first time, we recommend you read this tutorial.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
print(table)

Which results in the following output (or the top of it).

                 City               Country                                Name        Yearopened Year of lastexpansion             Stations                       System length             Annual ridership(millions)
0             Algiers               Algeria                       Algiers Metro          2011[13]              2018[14]               19[14]               18.5 km (11.5 mi)[15]                       45.3 (2019)[R 1]
1        Buenos Aires             Argentina            Buenos Aires Underground        1926[Nb 1]              2019[16]               90[17]               56.7 km (35.2 mi)[17]                      337.7 (2018)[R 2]
2             Yerevan               Armenia                       Yerevan Metro          1981[18]              1996[19]               10[18]                13.4 km (8.3 mi)[18]                       18.7 (2018)[R 3]
3              Sydney             Australia                        Sydney Metro          2019[20]                     –               13[20]               36 km (22 mi)[20][21]              14.2 (2019) [R 4][R Nb 1]
4              Vienna               Austria                       Vienna U-Bahn    1976[22][Nb 2]              2017[23]               98[24]               83.3 km (51.8 mi)[22]                      463.1 (2018)[R 6]
5                Baku            Azerbaijan                          Baku Metro          1967[25]              2016[25]               25[25]               36.6 km (22.7 mi)[25]                      231.0 (2018)[R 3]

We have now have the data in a DataFrame.

Step 2: Clean and convert the data

At first glance, we see that we do not need the rows City, Name, Yearopened, Year of last expansion. To make it easier to work with the data, let’s remove them and inspect the data again.

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
print(table)

Which result in the following output.

                  Country             Stations                       System length             Annual ridership(millions)
0                 Algeria               19[14]               18.5 km (11.5 mi)[15]                       45.3 (2019)[R 1]
1               Argentina               90[17]               56.7 km (35.2 mi)[17]                      337.7 (2018)[R 2]
2                 Armenia               10[18]                13.4 km (8.3 mi)[18]                       18.7 (2018)[R 3]
3               Australia               13[20]               36 km (22 mi)[20][21]              14.2 (2019) [R 4][R Nb 1]
4                 Austria               98[24]               83.3 km (51.8 mi)[22]                      463.1 (2018)[R 6]
5              Azerbaijan               25[25]               36.6 km (22.7 mi)[25]                      231.0 (2018)[R 3]
6                 Belarus               29[27]               37.3 km (23.2 mi)[27]                      283.4 (2018)[R 3]
7                 Belgium         59[28][Nb 5]               39.9 km (24.8 mi)[29]                      165.3 (2019)[R 7]

This makes it easier to see the next steps.

Let’s take them one by one. Stations need to remove the data after ‘[‘-symbol and convert the number to an integer. This can be done by using a lambda function to a row.

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)

If you are new to lambda functions we recommend you read this tutorial.

The next thing we need to do is to convert the System length to floats. The length will be in km (I live in Denmark, where we use km and not mi). This can also be done by using a lambda function

table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

Finally, and a bit more tricky, we need to convert the column of Annual ridership. The challenge is that lines have n/a which are converted to np.nan, but there are also some lines where the input is not easy to convert, as the images show.

From wikipedia.org
From wikipedia.org

These lines are can be dealt with by using a helper function.

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

Adding this all together we get the following code.

import pandas as pd
import numpy as np

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

print(table)

Which results in the following output (or the first few lines).

                  Country  Stations  System length  Annual ridership(millions)
0                 Algeria        19          18.50                       45.30
1               Argentina        90          56.70                      337.70
2                 Armenia        10          13.40                       18.70
3               Australia        13          36.00                       14.20
4                 Austria        98          83.30                      463.10
5              Azerbaijan        25          36.60                      231.00
6                 Belarus        29          37.30                      283.40
7                 Belgium        59          39.90                      165.30
8                  Brazil        19          28.10                       58.40
9                  Brazil        25          42.40                       42.80
10                 Brazil        22          43.80                       51.70

Step 3: Sum rows by country

Say, now we want to get the country with the most metro stations. This can be achieved by using the groupby and sum function from the pandas DataFrame data structure.

import pandas as pd
import numpy as np

def to_float(obj):
    try:
        return float(obj)
    except:
        return np.nan

url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)

# Sum up
table_sum = table.groupby(['Country']).sum()

print(table_sum.sort_values(['Stations'], ascending=False))

Where the result will be China.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
United States             1005        1325.90                     2771.50
South Korea                714         839.90                     4054.90
Japan[Nb 34]               669         791.20                     6489.60
India                      499         675.97                     1377.00
France                     483         350.90                     2113.50
Spain                      438         474.40                     1197.90

If we want to sort by km of System length, you will only need to change the last line to the following.

print(table_sum.sort_values(['System length'], ascending=False))

Resulting in the following.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
United States             1005        1325.90                     2771.50
South Korea                714         839.90                     4054.90
Japan[Nb 34]               669         791.20                     6489.60
India                      499         675.97                     1377.00
Russia                     368         611.50                     3507.60
United Kingdom             390         523.90                     1555.30

Finally, if you want it by Annual ridership, you will need to change the last line to.

print(table_sum.sort_values([index], ascending=False))

Remember, we assigned that to index. You should get the following output.

                      Stations  System length  Annual ridership(millions)
Country                                                                  
China                     3738        6312.16                    25519.23
Japan[Nb 34]               669         791.20                     6489.60
South Korea                714         839.90                     4054.90
Russia                     368         611.50                     3507.60
United States             1005        1325.90                     2771.50
France                     483         350.90                     2113.50
Brazil                     243         345.40                     2106.20

Pandas and Folium: Categorize GDP Growth by Country and Visualize on Map in 3 Easy Steps

What will we cover in this tutorial?

  • We will gather data from wikipedia.org List of countries by past and projected GDP using pandas.
  • First step will be get the data and merge the correct tables together.
  • Next step is using Machine Learning with Linear regression model to estimate the growth of each country GDP.
  • Final step is to visualize the growth rates on a leaflet map using folium.

Step 1: Get the data and merge it

The data is available on wikipedia on List of countries by past and projected GDP. We will focus on data from 1990 to 2019.

At first glance on the page you notice that the date is not gathered in one table.

From wikipedia.org

The first task will be to merge the three tables with the data from 1990-1999, 2000-2009, and 2010-2019.

The data can be collected by pandas read_html function. If you are new to this you can read this tutorial.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

print(table)

The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.

     Country (or dependent territory)       1990       1991       1992       1993       1994       1995       1996       1997       1998       1999        2000        2001        2002        2003        2004        2005        2006        2007        2008        2009        2010        2011        2012        2013        2014        2015        2016        2017        2018        2019
0                         Afghanistan        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN         NaN      4367.0      4514.0      5146.0      6167.0      6925.0      8556.0     10297.0     12066.0     15325.0     17890.0     20296.0     20170.0     20352.0     19687.0     19454.0     20235.0     19585.0     19990.0
1                             Albania     2221.0     1333.0      843.0     1461.0     2361.0     2882.0     3200.0     2259.0     2560.0     3209.0      3483.0      3928.0      4348.0      5611.0      7185.0      8052.0      8905.0     10675.0     12901.0     12093.0     11938.0     12896.0     12323.0     12784.0     13238.0     11393.0     11865.0     13055.0     15202.0     15960.0
2                             Algeria    61892.0    46670.0    49217.0    50963.0    42426.0    42066.0    46941.0    48178.0    48188.0    48845.0     54749.0     54745.0     56761.0     67864.0     85327.0    103198.0    117027.0    134977.0    171001.0    137054.0    161207.0    199394.0    209005.0    209703.0    213518.0    164779.0    159049.0    167555.0    180441.0    183687.0
3                              Angola    11236.0    10891.0     8398.0     6095.0     4438.0     5539.0     6535.0     7675.0     6506.0     6153.0      9130.0      8936.0     12497.0     14189.0     19641.0     28234.0     41789.0     60449.0     84178.0     75492.0     82471.0    104116.0    115342.0    124912.0    126777.0    102962.0     95337.0    122124.0    107316.0     92191.0
4                 Antigua and Barbuda      459.0      482.0      499.0      535.0      589.0      577.0      634.0      681.0      728.0      766.0       825.0       796.0       810.0       850.0       912.0      1013.0      1147.0      1299.0      1358.0      1216.0      1146.0      1140.0      1214.0      1194.0      1273.0      1353.0      1460.0      1516.0      1626.0      1717.0
5                           Argentina   153205.0   205515.0   247987.0   256365.0   279150.0   280080.0   295120.0   317549.0   324242.0   307673.0    308491.0    291738.0    108731.0    138151.0    164922.0    199273.0    232892.0    287920.0    363545.0    334633.0    424728.0    527644.0    579666.0    611471.0    563614.0    631621.0    554107.0    642928.0    518092.0    477743.0
6                             Armenia        NaN        NaN      108.0      835.0      648.0     1287.0     1597.0     1639.0     1892.0     1845.0      1912.0      2118.0      2376.0      2807.0      3577.0      4900.0      6384.0      9206.0     11662.0      8648.0      9260.0     10142.0     10619.0     11121.0     11610.0     10529.0     10572.0     11537.0     12411.0     13105.0

Step 2: Use linear regression to estimate the growth over the last 30 years

In this section we will use Linear regression from the scikit-learn library, which is a simple prediction tool.

If you are new to Machine Learning we recommend you read this tutorial on Linear regression.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)

regr = LinearRegression()
regr.fit(X, Y)

Y_pred = regr.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Which will result in the following plot.

Linear regression model applied on data from wikipedia.org

Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.

Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.

We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.

import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    #print(row)
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

print(data)

Which results in the following output (or the first few lines).

                              Country      Coef
0                         Afghanistan  0.161847
1                             Albania  0.243493
2                             Algeria  0.103907
3                              Angola  0.423919
4                 Antigua and Barbuda  0.087863
5                           Argentina  0.090837
6                             Armenia  4.699598

Step 3: Merge the data to a leaflet map using folium

The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.

import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])

coef = []
countries = []

for index, row in table.iterrows():
    X = table.columns[1:].to_numpy().reshape(-1, 1)
    X = X.astype(int)
    Y = 1 + row.iloc[1:].pct_change()
    Y = Y.cumprod().fillna(1.0).to_numpy()
    Y = Y.reshape(-1, 1)

    regr = LinearRegression()
    regr.fit(X, Y)

    coef.append(regr.coef_[0][0])
    countries.append(row[merge_index])

data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')

# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])


# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])

# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])

print(table)

# Create a map
my_map = folium.Map()

# Add the data
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=['Country', 'Cat'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Growth of GDP since 1990',
    threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')

There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.

Here we have used the qcut to add them in each equal sized group.

This should result in an interactive html page looking something like this.

End result.

From HTML Table Through Pandas to Leaflet Map in 5 Steps

What will we cover

  • You want to map data from an HTML table to an interactive map.
  • The data is not clean and it is difficult to map data to countries, as they are often called different.

Step 1: Using Pandas to read the data

We will look at a table of data from wikipedia.org. In this example we will look at the data from average human heights by country.

From wikipedia.org

Inspecting the first few columns you see a few issues already. There is some data missing and some countries are represented more than once.

To simplify our exercise we will only look at Average male height.

Let’s use pandas to read the content and inspect it. If you are new to pandas I can recommend the this post.

To read the content you can use the read_html(url) call from the pandas library. You need to instal lxml as well, see this post of details.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table
table = tables[0]

print(table[:20])

Which will result in the following output.

            Country/Region        Average male height  ...       Year    Source
0                  Albania   174.0 cm (5 ft 8 1⁄2 in)  ...  2008–2009  [11][12]
1                Argentina                        NaN  ...  2004–2005      [13]
2                Argentina  174.46 cm (5 ft 8 1⁄2 in)  ...  1998–2001      [14]
3                  Armenia                        NaN  ...       2005      [15]
4                Australia       175.6 cm (5 ft 9 in)  ...  2011–2012      [16]
5                  Austria    179 cm (5 ft 10 1⁄2 in)  ...       2006      [17]
6               Azerbaijan   171.8 cm (5 ft 7 1⁄2 in)  ...       2005      [18]
7                  Bahrain       165.1 cm (5 ft 5 in)  ...       2002      [19]
8                  Bahrain   171.0 cm (5 ft 7 1⁄2 in)  ...       2009  [20][21]
9               Bangladesh                        NaN  ...       2007      [15]
10          Country/Region        Average male height  ...       Year    Source
11                 Belgium  178.6 cm (5 ft 10 1⁄2 in)  ...       2001      [22]
12                   Benin                        NaN  ...       2006      [15]
13                 Bolivia                        NaN  ...       2003      [15]
14                 Bolivia       160.0 cm (5 ft 3 in)  ...       1970      [23]
15  Bosnia and Herzegovina       183.9 cm (6 ft 0 in)  ...       2014      [24]
16                  Brazil       170.7 cm (5 ft 7 in)  ...       2009  [25][26]
17          Brazil – Urban   173.5 cm (5 ft 8 1⁄2 in)  ...       2009      [25]
18          Brazil – Rural   170.9 cm (5 ft 7 1⁄2 in)  ...       2009      [25]
19                Bulgaria       175.2 cm (5 ft 9 in)  ...       2010      [27]

Where you by inspection of line 10 see a line of input that needs to be cleaned.

Step 2: Some basic cleaning of the data

By inspection of the data you see that every 10 lines (or something) an line repeats the column names.

From wikipedia.org

While this is practical if you inspect the data as a user, this seems to be annoying for us when we want to use the raw data.

Luckily this is easy to clean up using pandas.

import pandas as pd

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table
table = tables[0]

# To avoid writing it all the time
AVG_MH = 'Average male height'
# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()

print(table[:20])

Where you can see the data is has cleaned up these columns.

            Country/Region        Average male height  ...       Year    Source
0                  Albania   174.0 cm (5 ft 8 1⁄2 in)  ...  2008–2009  [11][12]
1                Argentina                        NaN  ...  2004–2005      [13]
2                Argentina  174.46 cm (5 ft 8 1⁄2 in)  ...  1998–2001      [14]
3                  Armenia                        NaN  ...       2005      [15]
4                Australia       175.6 cm (5 ft 9 in)  ...  2011–2012      [16]
5                  Austria    179 cm (5 ft 10 1⁄2 in)  ...       2006      [17]
6               Azerbaijan   171.8 cm (5 ft 7 1⁄2 in)  ...       2005      [18]
7                  Bahrain       165.1 cm (5 ft 5 in)  ...       2002      [19]
8                  Bahrain   171.0 cm (5 ft 7 1⁄2 in)  ...       2009  [20][21]
9               Bangladesh                        NaN  ...       2007      [15]
11                 Belgium  178.6 cm (5 ft 10 1⁄2 in)  ...       2001      [22]
12                   Benin                        NaN  ...       2006      [15]
13                 Bolivia                        NaN  ...       2003      [15]
14                 Bolivia       160.0 cm (5 ft 3 in)  ...       1970      [23]
15  Bosnia and Herzegovina       183.9 cm (6 ft 0 in)  ...       2014      [24]
16                  Brazil       170.7 cm (5 ft 7 in)  ...       2009  [25][26]
17          Brazil – Urban   173.5 cm (5 ft 8 1⁄2 in)  ...       2009      [25]
18          Brazil – Rural   170.9 cm (5 ft 7 1⁄2 in)  ...       2009      [25]
19                Bulgaria       175.2 cm (5 ft 9 in)  ...       2010      [27]
20            Burkina Faso                        NaN  ...       2003      [15]

Step 3: Convert data to floats

Inspecting the data that we need (Average male height) it is represented as a string with both the cm and ft/in figure. As I live in Denmark and we use the metric system and have never really understood any benefit of the US customary units (feel free to enlighten me).

Hence, we want to convert the strings in the column Average male height to a float representing the height in cm.

Notice, that some are NaN, while the rest are having the first number as the length in cm.

We can exploit that and convert it with a lambda function. If you are new to lambda functions you can see this tutorial.

import pandas as pd
import numpy as np

# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table
table = tables[0]

# To avoid writing it all the time
AVG_MH = 'Average male height'
AMH_F = 'Aveage male height (float)'

# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()

# Clean up data to have height in cm
table[AMH_F] = table.apply(lambda row: float(row[AVG_MH].split(' ')[0]) if row[AVG_MH] is not np.nan else np.nan,
                           axis=1)
print(table[:20])

Resulting in the following.

            Country/Region  ... Aveage male height (float)
0                  Albania  ...                     174.00
1                Argentina  ...                        NaN
2                Argentina  ...                     174.46
3                  Armenia  ...                        NaN
4                Australia  ...                     175.60
5                  Austria  ...                     179.00
6               Azerbaijan  ...                     171.80
7                  Bahrain  ...                     165.10
8                  Bahrain  ...                     171.00
9               Bangladesh  ...                        NaN
11                 Belgium  ...                     178.60
12                   Benin  ...                        NaN
13                 Bolivia  ...                        NaN
14                 Bolivia  ...                     160.00
15  Bosnia and Herzegovina  ...                     183.90
16                  Brazil  ...                     170.70
17          Brazil – Urban  ...                     173.50
18          Brazil – Rural  ...                     170.90
19                Bulgaria  ...                     175.20
20            Burkina Faso  ...                        NaN

Notice that np.nan is also a float and hence, the full column Average male height (float) are floats.

Step 4: Merge two sets of data with different representations of countries

To make the map in the end we will use the geopandas library, which has a nice low resolution dataset used to color countries. While the data by geopandas is represented as a DataFrame it is difficult to merge it as the DataFrame we have created from the htm_read call to pandas has varying names.

Example can be United States in the one we created and United States of America in the geopandas. Hence, we need some means to map them to the same representation.

For this purpose we can use the library pycountry.

Hence, applying that to both DataFrames we can merge them.

import pandas as pd
import numpy as np
import geopandas
import pycountry


# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table
table = tables[0]

# To avoid writing it all the time
AVG_MH = 'Average male height'
CR = 'Country/Region'
COUNTRY = 'Country'
AMH_F = 'Aveage male height (float)'
A3 = 'alpha3'

# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()

# Clean up data to have height in cm
table[AMH_F] = table.apply(lambda row: float(row[AVG_MH].split(' ')[0]) if row[AVG_MH] is not np.nan else np.nan,
                           axis=1)

# Clean up the names if used a dash before
table[COUNTRY] = table.apply(
    lambda row: row[CR].split(' – ')[0] if ' – ' in row[CR] else row[CR],
    axis=1)
# Map the country name to the alpha3 representation
table[A3] = table.apply(lambda row: lookup_country_code(row[COUNTRY]), axis=1)

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Do the same mapping to alpha3
world[A3] = world.apply(lambda row: lookup_country_code(row['name']), axis=1)

# Merge the data
table = world.merge(table, how="left", left_on=[A3], right_on=[A3])

# Remove countries with no data
table = table.dropna(subset=[AMH_F])

# These lines are just used to get the full data
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
print(table)

Which will result in the following.

        pop_est      continent                      name iso_a3  gdp_md_est                                           geometry       alpha3                                 Country/Region         Average male height      Average female height Stature ratio(male to female)                      Sample population / age range Share ofpop. over 18covered[9][10]    Methodology       Year           Source  Aveage male height (float)               Country
3      35623680  North America                    Canada    CAN   1674000.0  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...          CAN                                         Canada        175.1 cm (5 ft 9 in)       162.3 cm (5 ft 4 in)                          1.08                                              18–79                                 94.7%       Measured  2007–2009             [29]                      175.10                Canada
4     326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...          USA                                  United States        175.3 cm (5 ft 9 in)   161.5 cm (5 ft 3 1⁄2 in)                          1.09  All Americans, 20+ (N= m:5,232 f:5,547, Median...                                   69%       Measured  2011–2014            [132]                      175.30         United States
5     326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...          USA              United States – African Americans        175.5 cm (5 ft 9 in)       162.6 cm (5 ft 4 in)                          1.08  African Americans, 20–39 (N= m:532 f:612, Medi...                             3.4%[133]       Measured  2015-2016            [134]                      175.50         United States
6     326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...          USA  United States – Hispanic and Latino Americans    169.5 cm (5 ft 6 1⁄2 in)   156.7 cm (5 ft 1 1⁄2 in)                          1.08  Hispanic/Latin-Americans, 20–39 (N= m:745 f:91...                             4.4%[133]       Measured  2015–2016            [134]                      169.50         United States
7     326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...          USA              United States – Mexican Americans    168.8 cm (5 ft 6 1⁄2 in)   156.1 cm (5 ft 1 1⁄2 in)                          1.09  Mexican Americans, 20–39 (N= m:429 f:511, Medi...                             2.8%[133]       Measured  2015–2016            [134]                      168.80         United States
8     326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...          USA                United States – Asian Americans        169.7 cm (5 ft 7 in)   156.2 cm (5 ft 1 1⁄2 in)                          1.09  Non-Hispanic Asians, 20–39 (N= m:323 f:326, Me...                             1.3%[133]       Measured  2015–2016            [134]                      169.70         United States
9     326625791  North America  United States of America    USA  18560000.0  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...          USA            United States – Non-Hispanic whites    177.0 cm (5 ft 9 1⁄2 in)   163.3 cm (5 ft 4 1⁄2 in)                          1.08  Non-Hispanic White Americans, 20–39 (N= m:892 ...                            17.1%[133]       Measured  2015–2016            [134]                      177.00         United States
13    260580739           Asia                 Indonesia    IDN   3028000.0  MULTIPOLYGON (((141.00021 -2.60015, 141.01706 ...          IDN                                      Indonesia          158 cm (5 ft 2 in)        147 cm (4 ft 10 in)                          1.07  50+ (N= m:2,041 f:2,396, Median= m:158 cm (5 f...                                 22.5%  Self-reported       1997             [59]                      158.00             Indonesia
15     44293293  South America                 Argentina    ARG    879400.0  MULTIPOLYGON (((-68.63401 -52.63637, -68.25000...          ARG                                      Argentina   174.46 cm (5 ft 8 1⁄2 in)  161.01 cm (5 ft 3 1⁄2 in)                          1.08  Healthy, 18 (N= m:90 f:97, SD= m:7.43 cm (3 in...                                  2.9%       Measured  1998–2001             [14]                      174.46             Argentina
16     17789267  South America                     Chile    CHL    436100.0  MULTIPOLYGON (((-68.63401 -52.63637, -68.63335...          CHL                                          Chile        169.6 cm (5 ft 7 in)   156.1 cm (5 ft 1 1⁄2 in)                          1.09                                                15+                                107.2%       Measured  2009–2010             [30]                      169.60                 Chile
19     47615739         Africa                     Kenya    KEN    152700.0  POLYGON ((39.20222 -4.67677, 37.76690 -3.67712...          KEN                                          Kenya        169.6 cm (5 ft 7 in)                        NaN                           NaN        25–49 (N= f:1,600, SD= f:6.3 cm (2 1⁄2 in))                                 53.7%        Summary       2016             [69]                      169.60                 Kenya
20     47615739         Africa                     Kenya    KEN    152700.0  POLYGON ((39.20222 -4.67677, 37.76690 -3.67712...          KEN                                          Kenya        169.6 cm (5 ft 7 in)   158.2 cm (5 ft 2 1⁄2 in)                           NaN            25–49 (N= f:4,856, SD= f:7.3 cm (3 in))                                 52.5%         Survey       2016         [15][69]                      169.60                 Kenya
25    142257519         Europe                    Russia    RUS   3745000.0  MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...       Russia                                         Russia    171.1 cm (5 ft 7 1⁄2 in)   158.2 cm (5 ft 2 1⁄2 in)                          1.08                         44-69 (N= m: 3892 f: 4643)                                 38.5%       Measured       2007             [93]                      171.10                Russia
26    142257519         Europe                    Russia    RUS   3745000.0  MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...       Russia                                         Russia       177.2 cm (5 ft 10 in)   164.1 cm (5 ft 4 1⁄2 in)                          1.08                                                 24                                  1.9%       Measured       2004         [21][98]                      177.20                Russia
29      5320045         Europe                    Norway    -99    364700.0  MULTIPOLYGON (((15.14282 79.67431, 15.52255 80...          NOR                                         Norway   179.7 cm (5 ft 10 1⁄2 in)       167.1 cm (5 ft 6 in)                          1.09           Conscripts, 18–44 (N= m:30,884 f:28,796)                                 35.3%       Measured       2012             [88]                      179.70                Norway
30      5320045         Europe                    Norway    -99    364700.0  MULTIPOLYGON (((15.14282 79.67431, 15.52255 80...          NOR                                         Norway   179.7 cm (5 ft 10 1⁄2 in)     167 cm (5 ft 5 1⁄2 in)                          1.08                           20–85 (N= m:1534 f:1743)                                 93.6%  Self-reported  2008–2009      [9][26][89]                      179.70                Norway
34     54841552         Africa              South Africa    ZAF    739100.0  POLYGON ((16.34498 -28.57671, 16.82402 -28.082...          ZAF                                   South Africa          168 cm (5 ft 6 in)     159 cm (5 ft 2 1⁄2 in)                          1.06                                19 (N= m:121 f:118)                                  3.6%       Measured       2003            [110]                      168.00          South Africa
36    124574795  North America                    Mexico    MEX   2307000.0  POLYGON ((-117.12776 32.53534, -115.99135 32.6...          MEX                                         Mexico      172 cm (5 ft 7 1⁄2 in)     159 cm (5 ft 2 1⁄2 in)                          1.08                                              20–65                                 62.0%       Measured       2014             [83]                      172.00                Mexico
37      3360148  South America                   Uruguay    URY     73250.0  POLYGON ((-57.62513 -30.21629, -56.97603 -30.1...          URY                                        Uruguay          170 cm (5 ft 7 in)         158 cm (5 ft 2 in)                          1.08                        Adults (N= m:2,249 f:2,114)                                   NaN       Measured       1990            [135]                      170.00               Uruguay
38    207353391  South America                    Brazil    BRA   3081000.0  POLYGON ((-53.37366 -33.76838, -53.65054 -33.2...          BRA                                         Brazil        170.7 cm (5 ft 7 in)   158.8 cm (5 ft 2 1⁄2 in)                          1.07                         18+ (N= m:62,037 f:65,696)                                100.0%       Measured       2009         [25][26]                      170.70                Brazil
39    207353391  South America                    Brazil    BRA   3081000.0  POLYGON ((-53.37366 -33.76838, -53.65054 -33.2...          BRA                                 Brazil – Urban    173.5 cm (5 ft 8 1⁄2 in)   161.6 cm (5 ft 3 1⁄2 in)                          1.07                         20–24 (N= m:6,360 f:6,305)                                 10.9%       Measured       2009             [25]                      173.50                Brazil
40    207353391  South America                    Brazil    BRA   3081000.0  POLYGON ((-53.37366 -33.76838, -53.65054 -33.2...          BRA                                 Brazil – Rural    170.9 cm (5 ft 7 1⁄2 in)   158.9 cm (5 ft 2 1⁄2 in)                          1.07                         20–24 (N= m:1,939 f:1,633)                                  2.1%       Measured       2009             [25]                      170.90                Brazil
42     11138234  South America                   Bolivia    BOL     78350.0  POLYGON ((-69.52968 -10.95173, -68.78616 -11.0...          BOL                                        Bolivia        160.0 cm (5 ft 3 in)       142.2 cm (4 ft 8 in)                          1.13                                      Aymara, 20–29                                   NaN       Measured       1970             [23]                      160.00               Bolivia
43     31036656  South America                      Peru    PER    410400.0  POLYGON ((-69.89364 -4.29819, -70.79477 -4.251...          PER                                           Peru      164 cm (5 ft 4 1⁄2 in)    151 cm (4 ft 11 1⁄2 in)                          1.09                                                20+                             0.011509%       Measured       2005             [90]                      164.00                  Peru
44     47698524  South America                  Colombia    COL    688000.0  POLYGON ((-66.87633 1.25336, -67.06505 1.13011...          COL                                       Colombia        170.6 cm (5 ft 7 in)   158.7 cm (5 ft 2 1⁄2 in)                          1.07                 18–22 (N= m:1,528,875 f:1,468,110)                                 14.1%       Measured       2002             [33]                      170.60              Colombia
56     67106161         Europe                    France    -99   2699000.0  MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3...          FRA                                         France        175.6 cm (5 ft 9 in)       162.5 cm (5 ft 4 in)                          1.08                              18–70 (N= m/f:11,562)                                 85.9%       Measured  2003–2004         [45][46]                      175.60                France
57     67106161         Europe                    France    -99   2699000.0  MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3...          FRA                                         France    174.1 cm (5 ft 8 1⁄2 in)   161.9 cm (5 ft 3 1⁄2 in)                          1.08                                                20+                                 96.6%       Measured       2001              [7]                      174.10                France
58     16290913  South America                   Ecuador    ECU    182400.0  POLYGON ((-75.37322 -0.15203, -75.23372 -0.911...          ECU                                        Ecuador        167.1 cm (5 ft 6 in)     154.2 cm (5 ft 1⁄2 in)                          1.08                                                NaN                                   NaN       Measured       2014             [40]                      167.10               Ecuador
60      2990561  North America                   Jamaica    JAM     25390.0  POLYGON ((-77.56960 18.49053, -76.89662 18.400...          JAM                                        Jamaica    171.8 cm (5 ft 7 1⁄2 in)   160.8 cm (5 ft 3 1⁄2 in)                          1.07                                              25–74                                 71.4%       Measured  1994–1996             [66]                      171.80               Jamaica
61     11147407  North America                      Cuba    CUB    132900.0  POLYGON ((-82.26815 23.18861, -81.40446 23.117...          CUB                                   Cuba – Urban          168 cm (5 ft 6 in)     156 cm (5 ft 1 1⁄2 in)                          1.08                                                15+                                 79.2%       Measured       1999             [35]                      168.00                  Cuba
66     17885245         Africa                      Mali    MLI     38090.0  POLYGON ((-11.51394 12.44299, -11.46790 12.754...          MLI                           Mali – Southern Mali    171.3 cm (5 ft 7 1⁄2 in)       160.4 cm (5 ft 3 in)                          1.07  Rural adults (N= m:121 f:320, SD= m:6.6 cm (2 ...                                   NaN       Measured       1992             [81]                      171.30                  Mali
70    190632261         Africa                   Nigeria    NGA   1089000.0  POLYGON ((2.69170 6.25882, 2.74906 7.87073, 2....          NGA                                        Nigeria    163.8 cm (5 ft 4 1⁄2 in)       157.8 cm (5 ft 2 in)                          1.04                                              18–74                                 98.6%       Measured  1994–1996             [66]                      163.80               Nigeria
71    190632261         Africa                   Nigeria    NGA   1089000.0  POLYGON ((2.69170 6.25882, 2.74906 7.87073, 2....          NGA                                        Nigeria        167.2 cm (5 ft 6 in)       160.3 cm (5 ft 3 in)                          1.04  20–29 (N= m:139 f:76, SD= m:6.5 cm (2 1⁄2 in) ...                                 33.2%       Measured       2011             [87]                      167.20               Nigeria
72     24994885         Africa                  Cameroon    CMR     77240.0  POLYGON ((14.49579 12.85940, 14.89336 12.21905...          CMR                               Cameroon – Urban        170.6 cm (5 ft 7 in)   161.3 cm (5 ft 3 1⁄2 in)                          1.06                           15+ (N= m:3,746 f:5,078)                                 53.6%       Measured       2003             [28]                      170.60              Cameroon
75     27499924         Africa                     Ghana    GHA    120800.0  POLYGON ((0.02380 11.01868, -0.04978 10.70692,...          GHA                                          Ghana    169.5 cm (5 ft 6 1⁄2 in)   158.5 cm (5 ft 2 1⁄2 in)                          1.07                                              25–29                                 14.7%       Measured  1987–1989             [49]                      169.50                 Ghana
87     19196246         Africa                    Malawi    MWI     21200.0  POLYGON ((32.75938 -9.23060, 33.73972 -9.41715...          MWI                                 Malawi – Urban      166 cm (5 ft 5 1⁄2 in)         155 cm (5 ft 1 in)                          1.07  16–60 (N= m:583 f:315, SD= m:6.0 cm (2 1⁄2 in)...                                101.1%       Measured       2000             [78]                      166.00                Malawi
92      8299706           Asia                    Israel    ISR    297000.0  POLYGON ((35.71992 32.70919, 35.54567 32.39399...          ISR                                         Israel      177 cm (5 ft 9 1⁄2 in)     166 cm (5 ft 5 1⁄2 in)                          1.07                                              18–21                                  9.7%       Measured       2010             [64]                      177.00                Israel
96      2051363         Africa                    Gambia    GMB      3387.0  POLYGON ((-16.71373 13.59496, -15.62460 13.623...          GMB                                 Gambia – Rural        168.0 cm (5 ft 6 in)       157.8 cm (5 ft 2 in)                          1.06  21–49 (N= m:9,559 f:13,160, SD= m:6.7 cm (2 1⁄...                                   NaN       Measured  1950–1974             [47]                      168.00                Gambia
100     6072475           Asia      United Arab Emirates    ARE    667200.0  POLYGON ((51.57952 24.24550, 51.75744 24.29407...          ARE                           United Arab Emirates    173.4 cm (5 ft 8 1⁄2 in)   156.4 cm (5 ft 1 1⁄2 in)                          1.11                                                NaN                                   NaN            NaN        NaN            [128]                      173.40  United Arab Emirates
101     2314307           Asia                     Qatar    QAT    334500.0  POLYGON ((50.81011 24.75474, 50.74391 25.48242...          QAT                                          Qatar        170.8 cm (5 ft 7 in)   161.1 cm (5 ft 3 1⁄2 in)                          1.06                                                 18                                  1.9%       Measured       2005         [21][96]                      170.80                 Qatar
103    39192111           Asia                      Iraq    IRQ    596700.0  POLYGON ((39.19547 32.16101, 38.79234 33.37869...          IRQ                                 Iraq – Baghdad        165.4 cm (5 ft 5 in)   155.8 cm (5 ft 1 1⁄2 in)                          1.06  18–44 (N= m:700 f:800, SD= m:5.6 cm (2 in) f:1...                                 76.3%       Measured  1999–2000             [61]                      165.40                  Iraq
107    68414135           Asia                  Thailand    THA   1161000.0  POLYGON ((105.21878 14.27321, 104.28142 14.416...          THA                                       Thailand        170.3 cm (5 ft 7 in)     159 cm (5 ft 2 1⁄2 in)                          1.07  STOU students, 15–19 (N= m:839 f:1,636, SD= m:...                             0.2%[122]  Self-reported       2005            [123]                      170.30              Thailand
110    96160163           Asia                   Vietnam    VNM    594900.0  POLYGON ((104.33433 10.48654, 105.19991 10.889...          VNM                                        Vietnam        162.1 cm (5 ft 4 in)       152.2 cm (5 ft 0 in)                          1.07      25–29 (SD= m:5.39 cm (2 in) f:5.39 cm (2 in))                                 15.9%       Measured  1992–1993             [49]                      162.10               Vietnam
111    96160163           Asia                   Vietnam    VNM    594900.0  POLYGON ((104.33433 10.48654, 105.19991 10.889...          VNM                                        Vietnam        165.7 cm (5 ft 5 in)       155.2 cm (5 ft 1 in)                          1.07  Students, 20–25 (N= m:1,000 f:1,000, SD= m:6.5...                             2.0%[136]       Measured  2006–2007            [137]                      165.70               Vietnam
112    25248140           Asia               North Korea    PRK     40000.0  MULTIPOLYGON (((130.78000 42.22001, 130.78000 ...  North Korea                                    North Korea        165.6 cm (5 ft 5 in)       154.9 cm (5 ft 1 in)                          1.07                    Defectors, 20–39 (N= m/f:1,075)                                 46.4%       Measured       2005             [70]                      165.60           North Korea
113    51181299           Asia               South Korea    KOR   1929000.0  POLYGON ((126.17476 37.74969, 126.23734 37.840...  South Korea                                    South Korea        170.7 cm (5 ft 7 in)       157.4 cm (5 ft 2 in)                          1.08  20+ (N= m:2,750 f:2,445, Median= m:170.7 cm (5...                                 96.5%       Measured       2010             [71]                      170.70           South Korea
114    51181299           Asia               South Korea    KOR   1929000.0  POLYGON ((126.17476 37.74969, 126.23734 37.840...  South Korea                                    South Korea    173.5 cm (5 ft 8 1⁄2 in)                        NaN                           NaN                   Conscripts, 18–19 (N= m:323,800)                                  3.8%       Measured       2017             [72]                      173.50           South Korea
116     3068243           Asia                  Mongolia    MNG     37000.0  POLYGON ((87.75126 49.29720, 88.80557 49.47052...          MNG                                       Mongolia    168.4 cm (5 ft 6 1⁄2 in)       157.7 cm (5 ft 2 in)                          1.07                             25–34 (N= m:158 f:181)                                 27.6%       Measured       2006             [84]                      168.40              Mongolia
117  1281935911           Asia                     India    IND   8721000.0  POLYGON ((97.32711 28.26158, 97.40256 27.88254...          IND                                  India – Urban    174.3 cm (5 ft 8 1⁄2 in)   158.5 cm (5 ft 2 1⁄2 in)                          1.10  Private school students, 18 (N= m:34,411 f:30,...                                   NaN       Measured       2011             [55]                      174.30                 India
118  1281935911           Asia                     India    IND   8721000.0  POLYGON ((97.32711 28.26158, 97.40256 27.88254...          IND                                  India – Rural    161.5 cm (5 ft 3 1⁄2 in)       152.5 cm (5 ft 0 in)                          1.06       17 (SD= m:7.0 cm (3 in) f:6.3 cm (2 1⁄2 in))                                   NaN       Measured       2002             [56]                      161.50                 India
119  1281935911           Asia                     India    IND   8721000.0  POLYGON ((97.32711 28.26158, 97.40256 27.88254...          IND                                          India        164.7 cm (5 ft 5 in)       152.6 cm (5 ft 0 in)                          1.08                      20–49 (N= m:69,245 f:118,796)                                 44.3%       Measured  2005-2006             [57]                      164.70                 India
120  1281935911           Asia                     India    IND   8721000.0  POLYGON ((97.32711 28.26158, 97.40256 27.88254...          IND                        India – Patiala, Punjab       177.3 cm (5 ft 10 in)                        NaN                           NaN  Students, Punjabi, 18-25 (N: 149, SD = 7.88 cm...                                 22.4%       Measured       2013             [58]                      177.30                 India
123    29384297           Asia                     Nepal    NPL     71520.0  POLYGON ((88.12044 27.87654, 88.04313 27.44582...          NPL                                          Nepal        163.0 cm (5 ft 4 in)  150.8 cm (4 ft 11 1⁄2 in)                           NaN            25–49 (N= f:6,280, SD= f:5.5 cm (2 in))                                 52.9%  Self-reported       2006             [15]                      163.00                 Nepal
129    82021564           Asia                      Iran    IRN   1459000.0  POLYGON ((48.56797 29.92678, 48.01457 30.45246...         Iran                                           Iran        170.3 cm (5 ft 7 in)       157.2 cm (5 ft 2 in)                          1.08  21+ (N= m/f:89,532, SD= m:8.05 cm (3 in) f:7.2...                                 88.1%       Measured       2005             [60]                      170.30                  Iran
132     9960487         Europe                    Sweden    SWE    498100.0  POLYGON ((11.02737 58.85615, 11.46827 59.43239...          SWE                                         Sweden   181.5 cm (5 ft 11 1⁄2 in)   166.8 cm (5 ft 5 1⁄2 in)                          1.09                                              20–29                                 15.6%       Measured       2008            [116]                      181.50                Sweden
133     9960487         Europe                    Sweden    SWE    498100.0  POLYGON ((11.02737 58.85615, 11.46827 59.43239...          SWE                                         Sweden       177.9 cm (5 ft 10 in)       164.6 cm (5 ft 5 in)                          1.08                                              20–74                                 86.3%  Self-reported  1987–1994            [117]                      177.90                Sweden
136    38476269         Europe                    Poland    POL   1052000.0  POLYGON ((23.48413 53.91250, 23.52754 53.47012...          POL                                         Poland        172.2 cm (5 ft 8 in)       159.4 cm (5 ft 3 in)                          1.07                          44-69 (N= m:4336 f: 4559)                                 39.4%       Measured       2007             [93]                      172.20                Poland
137    38476269         Europe                    Poland    POL   1052000.0  POLYGON ((23.48413 53.91250, 23.52754 53.47012...          POL                                         Poland   178.7 cm (5 ft 10 1⁄2 in)       165.1 cm (5 ft 5 in)                          1.08                              18 (N= m:846 f:1,126)                                  1.6%       Measured       2010             [94]                      178.70                Poland
138     8754413         Europe                   Austria    AUT    416600.0  POLYGON ((16.97967 48.12350, 16.90375 47.71487...          AUT                                        Austria     179 cm (5 ft 10 1⁄2 in)     166 cm (5 ft 5 1⁄2 in)                          1.08                                              20–49                                 54.3%       Measured       2006             [17]                      179.00               Austria
139     9850845         Europe                   Hungary    HUN    267600.0  POLYGON ((22.08561 48.42226, 22.64082 48.15024...          HUN                                        Hungary      176 cm (5 ft 9 1⁄2 in)     164 cm (5 ft 4 1⁄2 in)                          1.07                                             Adults                                   NaN       Measured      2000s             [53]                      176.00               Hungary
140     9850845         Europe                   Hungary    HUN    267600.0  POLYGON ((22.08561 48.42226, 22.64082 48.15024...          HUN                                        Hungary       177.3 cm (5 ft 10 in)                        NaN                           NaN          18 (N= m:1,080, SD= m:5.99 cm (2 1⁄2 in))                                  1.7%       Measured       2005             [54]                      177.30               Hungary
142    21529967         Europe                   Romania    ROU    441000.0  POLYGON ((28.23355 45.48828, 28.67978 45.30403...          ROU                                        Romania      172 cm (5 ft 7 1⁄2 in)         157 cm (5 ft 2 in)                          1.10                                                NaN                                   NaN       Measured       2007             [97]                      172.00               Romania
143     2823859         Europe                 Lithuania    LTU     85620.0  POLYGON ((26.49433 55.61511, 26.58828 55.16718...          LTU                              Lithuania – Urban       178.4 cm (5 ft 10 in)                        NaN                           NaN  Conscripts, 19–25 (N= m:91 SD= m:6.7 cm (2 1⁄2...                                  9.9%       Measured   2005[75]             [76]                      178.40             Lithuania
144     2823859         Europe                 Lithuania    LTU     85620.0  POLYGON ((26.49433 55.61511, 26.58828 55.16718...          LTU                              Lithuania – Rural    176.2 cm (5 ft 9 1⁄2 in)                        NaN                           NaN  Conscripts, 19–25 (N= m:106 SD= m:5.9 cm (2 1⁄...                                  4.9%       Measured   2005[75]             [76]                      176.20             Lithuania
145     2823859         Europe                 Lithuania    LTU     85620.0  POLYGON ((26.49433 55.61511, 26.58828 55.16718...          LTU                                      Lithuania   181.3 cm (5 ft 11 1⁄2 in)       167.5 cm (5 ft 6 in)                          1.08                                                 18                                  2.1%       Measured       2001             [77]                      181.30             Lithuania
147     1251581         Europe                   Estonia    EST     38700.0  POLYGON ((27.98113 59.47537, 27.98112 59.47537...          EST                                        Estonia   179.1 cm (5 ft 10 1⁄2 in)                        NaN                           NaN                                                 17                                  2.3%       Measured       2003             [42]                      179.10               Estonia
148    80594017         Europe                   Germany    DEU   3979000.0  POLYGON ((14.11969 53.75703, 14.35332 53.24817...          DEU                                        Germany        175.4 cm (5 ft 9 in)       162.8 cm (5 ft 4 in)                          1.08                              18–79 (N= m/f:19,768)                                 94.3%       Measured       2007              [6]                      175.40               Germany
149    80594017         Europe                   Germany    DEU   3979000.0  POLYGON ((14.11969 53.75703, 14.35332 53.24817...          DEU                                        Germany         178 cm (5 ft 10 in)         165 cm (5 ft 5 in)                          1.08                         18+ (N= m:25,112 f:25,560)                                100.0%  Self-reported       2009             [48]                      178.00               Germany
150     7101510         Europe                  Bulgaria    BGR    143100.0  POLYGON ((22.65715 44.23492, 22.94483 43.82379...          BGR                                       Bulgaria        175.2 cm (5 ft 9 in)   163.2 cm (5 ft 4 1⁄2 in)                          1.07                                                NaN                                   NaN            NaN       2010             [27]                      175.20              Bulgaria
151    10768477         Europe                    Greece    GRC    290500.0  MULTIPOLYGON (((26.29000 35.29999, 26.16500 35...          GRC                                         Greece      177 cm (5 ft 9 1⁄2 in)         165 cm (5 ft 5 in)                          1.07                                              18–49                                 56.3%       Measured       2003             [17]                      177.00                Greece
152    80845215           Asia                    Turkey    TUR   1670000.0  MULTIPOLYGON (((44.77268 37.17044, 44.29345 37...          TUR                                         Turkey    173.6 cm (5 ft 8 1⁄2 in)   161.9 cm (5 ft 3 1⁄2 in)                          1.07                             20-22 (N= m:322 f:247)                                  8.3%       Measured       2007    [11][21][125]                      173.60                Turkey
153    80845215           Asia                    Turkey    TUR   1670000.0  MULTIPOLYGON (((44.77268 37.17044, 44.29345 37...          TUR                                Turkey – Ankara    174.1 cm (5 ft 8 1⁄2 in)   158.9 cm (5 ft 2 1⁄2 in)                          1.10  18–59 (N= m:703 f:512, Median= m:169.7 cm (5 f...                             5.1%[126]       Measured  2004–2006            [127]                      174.10                Turkey
155     3047987         Europe                   Albania    ALB     33900.0  POLYGON ((21.02004 40.84273, 20.99999 40.58000...          ALB                                        Albania    174.0 cm (5 ft 8 1⁄2 in)   161.8 cm (5 ft 3 1⁄2 in)                          1.08                           20–29 (N= m:649 f:1,806)                                 23.5%       Measured  2008–2009         [11][12]                      174.00               Albania
156     4292095         Europe                   Croatia    HRV     94240.0  POLYGON ((16.56481 46.50375, 16.88252 46.38063...          HRV                                        Croatia       180.4 cm (5 ft 11 in)  166.49 cm (5 ft 5 1⁄2 in)                          1.09  18 (N= m:358 f:360, SD= m:6.8 cm (2 1⁄2 in) f:...                                  1.6%       Measured  2006–2008             [34]                      180.40               Croatia
157     8236303         Europe               Switzerland    CHE    496300.0  POLYGON ((9.59423 47.52506, 9.63293 47.34760, ...          CHE                                    Switzerland       178.2 cm (5 ft 10 in)                        NaN                           NaN  Conscripts, 19 (N= m:12,447, Median= m:178.0 c...                                  1.5%       Measured       2009            [118]                      178.20           Switzerland
158     8236303         Europe               Switzerland    CHE    496300.0  POLYGON ((9.59423 47.52506, 9.63293 47.34760, ...          CHE                                    Switzerland        175.4 cm (5 ft 9 in)     164 cm (5 ft 4 1⁄2 in)                          1.07                                              20–74                                 88.8%  Self-reported  1987–1994            [117]                      175.40           Switzerland
160    11491346         Europe                   Belgium    BEL    508600.0  POLYGON ((6.15666 50.80372, 6.04307 50.12805, ...          BEL                                        Belgium   178.6 cm (5 ft 10 1⁄2 in)       168.1 cm (5 ft 6 in)                          1.06  21 (N= m:20–49 f:20–49, SD= m:6.6 cm (2 1⁄2 in...                                  1.7%  Self-reported       2001             [22]                      178.60               Belgium
161    17084719         Europe               Netherlands    NLD    870800.0  POLYGON ((6.90514 53.48216, 7.09205 53.14404, ...          NLD                                    Netherlands       180.8 cm (5 ft 11 in)       167.5 cm (5 ft 6 in)                          1.08                                                20+                                 96.8%  Self-reported       2013      [9][26][86]                      180.80           Netherlands
162    10839514         Europe                  Portugal    PRT    297100.0  POLYGON ((-9.03482 41.88057, -8.67195 42.13469...          PRT                                       Portugal    173.9 cm (5 ft 8 1⁄2 in)                        NaN                           NaN                                      18 (N= m:696)                                  1.5%       Measured       2008         [11][95]                      173.90              Portugal
163    10839514         Europe                  Portugal    PRT    297100.0  POLYGON ((-9.03482 41.88057, -8.67195 42.13469...          PRT                                       Portugal      171 cm (5 ft 7 1⁄2 in)     161 cm (5 ft 3 1⁄2 in)                          1.06                                              20–50                                 56.7%  Self-reported       2001             [17]                      171.00              Portugal
164    10839514         Europe                  Portugal    PRT    297100.0  POLYGON ((-9.03482 41.88057, -8.67195 42.13469...          PRT                                       Portugal    173.7 cm (5 ft 8 1⁄2 in)   163.7 cm (5 ft 4 1⁄2 in)                          1.06  21 (N= m:87 f:106, SD= m:8.2 cm (3 in) f:5.3 c...                                  1.9%  Self-reported       2001             [22]                      173.70              Portugal
165    48958159         Europe                     Spain    ESP   1690000.0  POLYGON ((-7.45373 37.09779, -7.53711 37.42890...          ESP                                          Spain        173.1 cm (5 ft 8 in)                        NaN                           NaN                       18–70 (N= m:1,298 [s][112] )                                 88.2%       Measured  2013–2014       [113][114]                      173.10                 Spain
167    48958159         Europe                     Spain    ESP   1690000.0  POLYGON ((-7.45373 37.09779, -7.53711 37.42890...          ESP                                          Spain      174 cm (5 ft 8 1⁄2 in)         163 cm (5 ft 4 in)                          1.07                                              20–49                                 57.0%  Self-reported       2007             [17]                      174.00                 Spain
168     5011102         Europe                   Ireland    IRL    322000.0  POLYGON ((-6.19788 53.86757, -6.03299 53.15316...          IRL                                        Ireland      177 cm (5 ft 9 1⁄2 in)         163 cm (5 ft 4 in)                          1.09                                              20–49                                 61.8%       Measured       2007             [17]                      177.00               Ireland
169     5011102         Europe                   Ireland    IRL    322000.0  POLYGON ((-6.19788 53.86757, -6.03299 53.15316...          IRL                                        Ireland     179 cm (5 ft 10 1⁄2 in)         165 cm (5 ft 5 in)                          1.08                                                 18                                     -       Measured       2014         [62][63]                      179.00               Ireland
172     4510327        Oceania               New Zealand    NZL    174800.0  MULTIPOLYGON (((176.88582 -40.06598, 176.50802...          NZL                                    New Zealand      177 cm (5 ft 9 1⁄2 in)     164 cm (5 ft 4 1⁄2 in)                          1.08                                              20–49                                 56.9%       Measured       2007             [17]                      177.00           New Zealand
173    23232413        Oceania                 Australia    AUS   1189000.0  MULTIPOLYGON (((147.68926 -40.80826, 148.28907...          AUS                                      Australia        175.6 cm (5 ft 9 in)   161.8 cm (5 ft 3 1⁄2 in)                          1.09                                                18+                                100.0%       Measured  2011–2012             [16]                      175.60             Australia
174    22409381           Asia                 Sri Lanka    LKA    236700.0  POLYGON ((81.78796 7.52306, 81.63732 6.48178, ...          LKA                                      Sri Lanka    163.6 cm (5 ft 4 1⁄2 in)  151.4 cm (4 ft 11 1⁄2 in)                          1.08  18+ (N= m:1,768 f:2,709, SD= m:6.9 cm (2 1⁄2 i...                                100.0%       Measured  2005–2006            [111]                      163.60             Sri Lanka
175  1379302771           Asia                     China    CHN  21140000.0  MULTIPOLYGON (((109.47521 18.19770, 108.65521 ...          CHN                                          China    169.5 cm (5 ft 6 1⁄2 in)       158.0 cm (5 ft 2 in)                          1.07                                  18-69 (N=172,422)                                 76.8%       Measured       2014             [31]                      169.50                 China
176  1379302771           Asia                     China    CHN  21140000.0  MULTIPOLYGON (((109.47521 18.19770, 108.65521 ...          CHN                        China – Beijing – Urban        175.2 cm (5 ft 9 in)       162.6 cm (5 ft 4 in)                          1.08                         Urban, 18 (N= m:448 f:405)                                  0.5%       Measured       2011             [32]                      175.20                 China
177    23508428           Asia                    Taiwan    TWN   1127000.0  POLYGON ((121.77782 24.39427, 121.17563 22.790...          TWN                                         Taiwan    171.4 cm (5 ft 7 1⁄2 in)       159.9 cm (5 ft 3 in)                          1.07                                17 (N= m:200 f:200)                                  1.7%       Measured       2011  [119][120][121]                      171.40                Taiwan
178    62137802         Europe                     Italy    ITA   2221000.0  MULTIPOLYGON (((10.44270 46.89355, 11.04856 46...          ITA                                          Italy    176.5 cm (5 ft 9 1⁄2 in)       162.5 cm (5 ft 4 in)                          1.09                                                 18                                  1.4%       Measured  1999–2004     [11][21][65]                      176.50                 Italy
179    62137802         Europe                     Italy    ITA   2221000.0  MULTIPOLYGON (((10.44270 46.89355, 11.04856 46...          ITA                                          Italy       177.2 cm (5 ft 10 in)       167.8 cm (5 ft 6 in)                          1.06  21 (N= m:106 f:92, SD= m:6.0 cm (2 1⁄2 in) f:6...                                  1.4%  Self-reported       2001             [22]                      177.20                 Italy
180     5605948         Europe                   Denmark    DNK    264800.0  MULTIPOLYGON (((9.92191 54.98310, 9.28205 54.8...          DNK                                        Denmark       180.4 cm (5 ft 11 in)       167.2 cm (5 ft 6 in)                           NaN                    Conscripts, 18–20 (N= m:38,025)                                  5.3%       Measured       2012             [37]                      180.40               Denmark
181    64769452         Europe            United Kingdom    GBR   2788000.0  MULTIPOLYGON (((-6.19788 53.86757, -6.95373 54...          GBR                       United Kingdom – England        175.3 cm (5 ft 9 in)   161.9 cm (5 ft 3 1⁄2 in)                          1.08                           16+ (N= m:3,154 f:3,956)                           103.2%[129]       Measured       2012              [5]                      175.30        United Kingdom
182    64769452         Europe            United Kingdom    GBR   2788000.0  MULTIPOLYGON (((-6.19788 53.86757, -6.95373 54...          GBR                      United Kingdom – Scotland        175.0 cm (5 ft 9 in)   161.3 cm (5 ft 3 1⁄2 in)                          1.08  16+ (N= m:2,512 f:3,180, Median= m:174.8 cm (5...                           103.0%[129]       Measured       2008            [130]                      175.00        United Kingdom
183    64769452         Europe            United Kingdom    GBR   2788000.0  MULTIPOLYGON (((-6.19788 53.86757, -6.95373 54...          GBR                         United Kingdom – Wales    177.0 cm (5 ft 9 1⁄2 in)       162.0 cm (5 ft 4 in)                          1.09                                                16+                           103.2%[129]  Self-reported       2009            [131]                      177.00        United Kingdom
184      339747         Europe                   Iceland    ISL     16150.0  POLYGON ((-14.50870 66.45589, -14.73964 65.808...          ISL                                        Iceland     181 cm (5 ft 11 1⁄2 in)         168 cm (5 ft 6 in)                          1.08                                              20–49                                 43.6%  Self-reported       2007             [17]                      181.00               Iceland
185     9961396           Asia                Azerbaijan    AZE    167900.0  MULTIPOLYGON (((46.40495 41.86068, 46.68607 41...          AZE                                     Azerbaijan    171.8 cm (5 ft 7 1⁄2 in)       165.4 cm (5 ft 5 in)                          1.04                                                16+                                106.5%       Measured       2005             [18]                      171.80            Azerbaijan
187   104256076           Asia               Philippines    PHL    801900.0  MULTIPOLYGON (((120.83390 12.70450, 120.32344 ...          PHL                                    Philippines    163.5 cm (5 ft 4 1⁄2 in)       151.8 cm (5 ft 0 in)                          1.08                                              20–39                             31.5%[91]       Measured       2003             [92]                      163.50           Philippines
188    31381992           Asia                  Malaysia    MYS    863000.0  MULTIPOLYGON (((100.08576 6.46449, 100.25960 6...          MYS                                       Malaysia    166.3 cm (5 ft 5 1⁄2 in)       154.7 cm (5 ft 1 in)                          1.07  Malay, 20–24 (N= m:749 f:893, Median= m:166 cm...                              9.7%[79]       Measured       1996             [80]                      166.30              Malaysia
189    31381992           Asia                  Malaysia    MYS    863000.0  MULTIPOLYGON (((100.08576 6.46449, 100.25960 6...          MYS                                       Malaysia    168.5 cm (5 ft 6 1⁄2 in)       158.1 cm (5 ft 2 in)                          1.07  Chinese, 20–24 (N= m:407 f:453, Median= m:169 ...                              4.1%[79]       Measured       1996             [80]                      168.50              Malaysia
190    31381992           Asia                  Malaysia    MYS    863000.0  MULTIPOLYGON (((100.08576 6.46449, 100.25960 6...          MYS                                       Malaysia    169.1 cm (5 ft 6 1⁄2 in)       155.4 cm (5 ft 1 in)                          1.09  Indian, 20–24 (N= m:113 f:140, Median= m:168 c...                              1.2%[79]       Measured       1996             [80]                      169.10              Malaysia
191    31381992           Asia                  Malaysia    MYS    863000.0  MULTIPOLYGON (((100.08576 6.46449, 100.25960 6...          MYS                                       Malaysia    163.3 cm (5 ft 4 1⁄2 in)       151.9 cm (5 ft 0 in)                          1.08  Other indigenous, 20–24 (N= m:257 f:380, Media...                              0.4%[79]       Measured       1996             [80]                      163.30              Malaysia
193     1972126         Europe                  Slovenia    SVN     68350.0  POLYGON ((13.80648 46.50931, 14.63247 46.43182...          SVN                           Slovenia – Ljubljana       180.3 cm (5 ft 11 in)       167.4 cm (5 ft 6 in)                          1.08                                                 19                             0.2%[108]       Measured       2011            [109]                      180.30              Slovenia
194     5491218         Europe                   Finland    FIN    224137.0  POLYGON ((28.59193 69.06478, 28.44594 68.36461...          FIN                                        Finland   178.9 cm (5 ft 10 1⁄2 in)       165.3 cm (5 ft 5 in)                          1.08                               25–34 (N= m/f:2,305)                                 19.0%       Measured       1994             [43]                      178.90               Finland
195     5491218         Europe                   Finland    FIN    224137.0  POLYGON ((28.59193 69.06478, 28.44594 68.36461...          FIN                                        Finland       180.7 cm (5 ft 11 in)       167.2 cm (5 ft 6 in)                          1.08                                −25 (N= m/f:26,636)                                  9.2%       Measured  2010–2011         [43][44]                      180.70               Finland
196     5445829         Europe                  Slovakia    SVK    168800.0  POLYGON ((22.55814 49.08574, 22.28084 48.82539...          SVK                                       Slovakia   179.4 cm (5 ft 10 1⁄2 in)       165.6 cm (5 ft 5 in)                          1.08                                                 18                                  2.0%       Measured       2004            [107]                      179.40              Slovakia
197    10674723         Europe                   Czechia    CZE    350900.0  POLYGON ((15.01700 51.10667, 15.49097 50.78473...          CZE                                 Czech Republic       180.3 cm (5 ft 11 in)      167.22 cm (5 ft 6 in)                          1.08                                                 17                                  1.6%       Measured       2001             [36]                      180.30        Czech Republic
199   126451398           Asia                     Japan    JPN   4932000.0  MULTIPOLYGON (((141.88460 39.18086, 140.95949 ...          JPN                                          Japan      172 cm (5 ft 7 1⁄2 in)         158 cm (5 ft 2 in)                          1.08                                              20–49                                 47.2%       Measured       2005             [17]                      172.00                 Japan
200   126451398           Asia                     Japan    JPN   4932000.0  MULTIPOLYGON (((141.88460 39.18086, 140.95949 ...          JPN                                          Japan    172.0 cm (5 ft 7 1⁄2 in)  158.70 cm (5 ft 2 1⁄2 in)                          1.08  20–24 (N= m:1,708 f:1,559, SD= m:5.42 cm (2 in...                                  7.2%       Measured       2004             [67]                      172.00                 Japan
201   126451398           Asia                     Japan    JPN   4932000.0  MULTIPOLYGON (((141.88460 39.18086, 140.95949 ...          JPN                                          Japan        170.7 cm (5 ft 7 in)       158.0 cm (5 ft 2 in)                          1.08                                                 17                                  1.2%       Measured       2013             [68]                      170.70                 Japan
204    28571770           Asia              Saudi Arabia    SAU   1731000.0  POLYGON ((34.95604 29.35655, 36.06894 29.19749...          SAU                                   Saudi Arabia    168.9 cm (5 ft 6 1⁄2 in)   156.3 cm (5 ft 1 1⁄2 in)                          1.08                                                 18                                  3.0%       Measured       2010        [21][100]                      168.90          Saudi Arabia
205    28571770           Asia              Saudi Arabia    SAU   1731000.0  POLYGON ((34.95604 29.35655, 36.06894 29.19749...          SAU                                   Saudi Arabia      174 cm (5 ft 8 1⁄2 in)                        NaN                           NaN                                                NaN                                   NaN            NaN       2017            [101]                      174.00          Saudi Arabia
210    97041072         Africa                     Egypt    EGY   1105000.0  POLYGON ((36.86623 22.00000, 32.90000 22.00000...          EGY                                          Egypt        170.3 cm (5 ft 7 in)   158.9 cm (5 ft 2 1⁄2 in)                          1.07                           20–24 (N= m:845 f:1,059)                                 16.6%       Measured       2008             [41]                      170.30                 Egypt
220     7111024         Europe                    Serbia    SRB    101800.0  POLYGON ((18.82982 45.90887, 18.82984 45.90888...          SRB                                         Serbia   182.0 cm (5 ft 11 1⁄2 in)   166.8 cm (5 ft 5 1⁄2 in)                          1.09  Students at UNS,18–30 (N= m:318 f:76, SD= m:6....                             0.7%[102]       Measured       2012            [103]                      182.00                Serbia
221      642550         Europe                Montenegro    MNE     10610.0  POLYGON ((20.07070 42.58863, 19.80161 42.50009...          MNE                                     Montenegro        183.4 cm (6 ft 0 in)   169.4 cm (5 ft 6 1⁄2 in)                          1.09  17-20 (N= m:981 f:1107, SD= m:6.89 cm (2 1⁄2 i...                                  5.2%       Measured       2017             [85]                      183.40            Montenegro
222     1895250         Europe                    Kosovo    -99     18490.0  POLYGON ((20.59025 41.85541, 20.52295 42.21787...       Kosovo                             Kosovo – Prishtina  179.52 cm (5 ft 10 1⁄2 in)      165.72 cm (5 ft 5 in)                           NaN  Conscripts, 18-20 (N= m:830 f:793, SD= m:7.02 ...                                 63.0%       Measured       2017             [74]                      179.52                Kosovo

Also, notice that we also removed the rows, which has no Average male height.

Step 5: Create the map with our data

Now we have done all the hard work.

It is time to use folium to do the last piece of work.

Let’s put it all together.

import pandas as pd
import numpy as np
import folium
import geopandas
import pycountry


# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
    try:
        return pycountry.countries.lookup(country).alpha_3
    except LookupError:
        return country


# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)

# The data is in the first table
table = tables[0]

# To avoid writing it all the time
AVG_MH = 'Average male height'
CR = 'Country/Region'
COUNTRY = 'Country'
AMH_F = 'Aveage male height (float)'
A3 = 'alpha3'

# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()

# Clean up data to have height in cm
table[AMH_F] = table.apply(lambda row: float(row[AVG_MH].split(' ')[0]) if row[AVG_MH] is not np.nan else np.nan,
                           axis=1)

# Clean up the names if used a dash before
table[COUNTRY] = table.apply(
    lambda row: row[CR].split(' – ')[0] if ' – ' in row[CR] else row[CR],
    axis=1)
# Map the country name to the alpha3 representation
table[A3] = table.apply(lambda row: lookup_country_code(row[COUNTRY]), axis=1)

# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Do the same mapping to alpha3
world[A3] = world.apply(lambda row: lookup_country_code(row['name']), axis=1)

# Merge the data
table = world.merge(table, how="left", left_on=[A3], right_on=[A3])

# Remove countries with no data
table = table.dropna(subset=[AMH_F])

# Creating a map
my_map = folium.Map()

# Adding the data from our table
folium.Choropleth(
    geo_data=table,
    name='choropleth',
    data=table,
    columns=[A3, AMH_F],
    key_on='feature.properties.alpha3',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Male height'
).add_to(my_map)
# Save the map to an html file
my_map.save('height_map.html')

Which should result in a map like this you can use in your browser. Zoom in and out.

The result.

This is nice. Good job.