We will investigate if we can create a decent video mosaic effect on a live webcam stream using OpenCV, Numba and Python. First we will learn the simple way to create a video mosaic and investigate the performance of that. Then we will extend that to create a better quality video mosaic and try to improve the performance by lowering the quality.
Step 1: How does simple photo mosaic work?
A photographic mosaic is a photo generated by other small images. A black and white example is given here.
The above is not a perfect example of it as it is generated with speed to get it running smooth from a webcam stream. Also, it is done in gray scale to improve performance.
The idea is to generate the original image (photograph) by mosaic technique by a lot of smaller sampled images. This is done in the above with the original frame of 640×480 pixels and the mosaic is constructed of small images of size 16×12 pixels.
The first thing we want to achieve is to create a simple mosaic. A simple mosaic is when the original image is scaled down and each pixel is then exchanged with one small image with the same average color. This is simple and efficient to do.
On a high level this is the process.
Have a collection C of small images used to create the photographic mosaic
Scale down the photo P you want to create a mosaic of.
For each pixel in photo P find the image I from C that has the closed average color as the pixel. Insert image I to represent that pixel.
This explains the simple way of doing. The next question is, will it be efficient enough to have a live webcam stream processed?
Step 2: Create a collection of small images
To optimize performance we have chosen to make it in gray scale. The first step is to collect images you want to use. This can be any pictures.
We have used photos from Pexels, which are all free for use without copyright.
What we need is to convert them all to gray scale and resize to fit our purpose.
The script assumes that we have located the images we want to convert to gray scale and resize are located in the local folder pics. Further, we assume that the output images (the processed images) will be put in an already existing folder small-pics-16×12.
Step 3: Get a live stream from the webcam
On a high level a live stream from a webcam is given in the following diagram.
This process framework is given in the code below.
import cv2
import numpy as np
def process(frame):
return frame
def main():
# Get the webcam (default webcam is 0)
cap = cv2.VideoCapture(0)
# If your webcam does not support 640 x 480, this will find another resolution
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
while True:
# Read the a frame from webcam
_, frame = cap.read()
# Flip the frame
frame = cv2.flip(frame, 1)
frame = cv2.resize(frame, (640, 480))
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Update the frame
updated_frame = process(gray)
# Show the frame in a window
cv2.imshow('WebCam', updated_frame)
# Check if q has been pressed to quit
if cv2.waitKey(1) == ord('q'):
break
# When everything done, release the capture
cap.release()
cv2.destroyAllWindows()
main()
The above code is just an empty shell where the function call to process is where the all the processing will be. This code will just generate a window that shows a gray scale image.
Step 4: The simple video mosaic
We need to introduce two main things to create this simple video mosaic.
Loading all the images we need to use (the 16×12 gray scale images).
Fill out the processing of each frame, which replaces each 16×12 box of the frame with the best matching image.
The first step is preprocessing and should be done before we enter the main loop of the webcam capturing. The second part is done in each iteration inside the process function.
import cv2
import numpy as np
import glob
import os
def preprocess():
path = "small-pics-16x12"
files = glob.glob(os.path.join(path, "*"))
files.sort()
images = []
for filename in files:
img = cv2.imread(filename)
images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
return np.stack(images)
def process(frame, images, box_height=12, box_width=16):
height, width = frame.shape
for i in range(0, height, box_height):
for j in range(0, width, box_width):
roi = frame[i:i + box_height, j:j + box_width]
mean = np.mean(roi[:, :])
roi[:, :] = images[int((len(images)-1)*mean/256)]
return frame
def main(images):
# Get the webcam (default webcam is 0)
cap = cv2.VideoCapture(0)
# If your webcam does not support 640 x 480, this will find another resolution
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
while True:
# Read the a frame from webcam
_, frame = cap.read()
# Flip the frame
frame = cv2.flip(frame, 1)
frame = cv2.resize(frame, (640, 480))
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Update the frame
mosaic_frame = process(gray, images)
# Show the frame in a window
cv2.imshow('Mosaic Video', mosaic_frame)
cv2.imshow('Webcam', frame)
# Check if q has been pressed to quit
if cv2.waitKey(1) == ord('q'):
break
# When everything done, release the capture
cap.release()
cv2.destroyAllWindows()
images = preprocess()
main(images)
The preprocessing function reads all the images, converts them to gray scale (to have only 1 channel per pixel), and returns them as a NumPy array to have optimized code.
The process function takes and breaks down the image in blocks of 16×12 pixels, computes the average gray scale, and takes the estimated best match. Notice the average (mean) value is a float, hence, we can have more than 256 gray scale images.
In this example we used 1.885 images to process it.
A result can be seen here.
The result is decent but not good.
Step 5: Testing the performance and improve it by using Numba
While the performance is quite good, let us test it.
Process time 0.02651691436767578 seconds
Process time 0.026834964752197266 seconds
Process time 0.025418996810913086 seconds
Process time 0.02562689781188965 seconds
Process time 0.025369882583618164 seconds
Process time 0.025450944900512695 seconds
Or a few lines from it. About 0.025-0.027 seconds.
Let’s try to use Numba in the equation. Numba is a just-in-time compiler for NumPy code. That means it compiles to python code to a binary for speed. If you are new to Numba we recommend you read this tutorial.
import cv2
import numpy as np
import glob
import os
import time
from numba import jit
def preprocess():
path = "small-pics-16x12"
files = glob.glob(os.path.join(path, "*"))
files.sort()
images = []
for filename in files:
img = cv2.imread(filename)
images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
return np.stack(images)
@jit(nopython=True)
def process(frame, images, box_height=12, box_width=16):
height, width = frame.shape
for i in range(0, height, box_height):
for j in range(0, width, box_width):
roi = frame[i:i + box_height, j:j + box_width]
mean = np.mean(roi[:, :])
roi[:, :] = images[int((len(images)-1)*mean/256)]
return frame
def main(images):
# Get the webcam (default webcam is 0)
cap = cv2.VideoCapture(0)
# If your webcam does not support 640 x 480, this will find another resolution
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
while True:
# Read the a frame from webcam
_, frame = cap.read()
# Flip the frame
frame = cv2.flip(frame, 1)
frame = cv2.resize(frame, (640, 480))
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Update the frame
start = time.time()
mosaic_frame = process(gray, images)
print("Process time", time.time()- start, "seconds")
# Show the frame in a window
cv2.imshow('Mosaic Video', mosaic_frame)
cv2.imshow('Webcam', frame)
# Check if q has been pressed to quit
if cv2.waitKey(1) == ord('q'):
break
# When everything done, release the capture
cap.release()
cv2.destroyAllWindows()
images = preprocess()
main(images)
This gives the following performance.
Process time 0.0014820098876953125 seconds
Process time 0.0013887882232666016 seconds
Process time 0.0015859603881835938 seconds
Process time 0.0016350746154785156 seconds
Process time 0.0018379688262939453 seconds
Process time 0.0016241073608398438 seconds
Which is a factor 15-20 speed improvement.
Good enough for live streaming. But the result is still not decent.
Step 6: A more advanced video mosaic approach
The more advanced video mosaic consist of approximating the each replacement box of pixels by the replacement image pixel by pixel.
import cv2
import numpy as np
import glob
import os
import time
from numba import jit
def preprocess():
path = "small-pics-16x12"
files = glob.glob(os.path.join(path, "*"))
files.sort()
images = []
for filename in files:
img = cv2.imread(filename)
images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
return np.stack(images)
@jit(nopython=True)
def process(frame, images, box_height=12, box_width=16):
height, width = frame.shape
for i in range(0, height, box_height):
for j in range(0, width, box_width):
roi = frame[i:i + box_height, j:j + box_width]
best_match = np.inf
best_match_index = 0
for k in range(1, images.shape[0]):
total_sum = np.sum(np.where(roi > images[k], roi - images[k], images[k] - roi))
if total_sum < best_match:
best_match = total_sum
best_match_index = k
roi[:,:] = images[best_match_index]
return frame
def main(images):
# Get the webcam (default webcam is 0)
cap = cv2.VideoCapture(0)
# If your webcam does not support 640 x 480, this will find another resolution
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
while True:
# Read the a frame from webcam
_, frame = cap.read()
# Flip the frame
frame = cv2.flip(frame, 1)
frame = cv2.resize(frame, (640, 480))
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Update the frame
start = time.time()
mosaic_frame = process(gray, images)
print("Process time", time.time()- start, "seconds")
# Show the frame in a window
cv2.imshow('Mosaic Video', mosaic_frame)
cv2.imshow('Webcam', frame)
# Check if q has been pressed to quit
if cv2.waitKey(1) == ord('q'):
break
# When everything done, release the capture
cap.release()
cv2.destroyAllWindows()
images = preprocess()
main(images)
Which is needed, as we work with unsigned 8 bit integers. What it does is, that it takes the and calculates the difference between each pixel in the region of interest (roi) and the image[k]. This is a very expensive calculation as we will see.
Performance shows the following.
Process time 7.030380010604858 seconds
Process time 7.034134149551392 seconds
Process time 7.105709075927734 seconds
Process time 7.138839960098267 seconds
Over 7 seconds for each frame. The result is what can be expected by using this amount of images, but the performance is too slow to have a flowing smooth live webcam stream.
The result can be seen here.
Step 7: Compromise options
There are various options to compromise for speed and we will not investigate all. Here are some.
Use fever images in our collection (use less than 1.885 images). Notice, that using half the images, say 900 images, will only speed up 50%.
Bigger image sizes. Scaling up to use 32×24 images. Here we will still need to do a lot of processing per pixel still. Hence, the expected speedup might be less than expected.
Make a compromised version of the difference calculation (total_sum). This has great potential, but might have undesired effects.
Scale down pixel estimation for fever calculations.
We will try the last two.
First, let’s try to exchange the calculation of total_sum, which is our distance function that measures how close our image is. Say, we use this.
total_sum = np.sum(np.subtract(roi, images[k]))
This results in overflow if we have a calculation like 1 – 2 = 255, which is undesired. On the other hand. It might happen in expected 50% of the cases, and maybe it will skew the calculation evenly for all images.
Let’s try.
Process time 1.857623815536499 seconds
Process time 1.7193729877471924 seconds
Process time 1.7445549964904785 seconds
Process time 1.707035779953003 seconds
Process time 1.6778359413146973 seconds
Wow. That is a speedup of a factor 4-6 per frame. The quality is still fine, but you will notice a poorly mapped image from time to time. But the result is close to the advanced video mosaic and far from the first simple video mosaic.
Another addition we could make is to estimate each box by only 4 pixels. This should still be better than the simple video mosaic approach. I have given the full code below.
import cv2
import numpy as np
import glob
import os
import time
from numba import jit
def preprocess():
path = "small-pics-16x12"
files = glob.glob(os.path.join(path, "*"))
files.sort()
images = []
for filename in files:
img = cv2.imread(filename)
images.append(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY))
return np.stack(images)
def preprocess2(images, scale_width=8, scale_height=6):
scaled = []
_, height, width = images.shape
print("Dimensions", width, height)
width //= scale_width
height //= scale_height
print("Scaled Dimensions", width, height)
for i in range(images.shape[0]):
scaled.append(cv2.resize(images[i], (width, height)))
return np.stack(scaled)
@jit(nopython=True)
def process3(frame, frame_scaled, images, scaled, box_height=12, box_width=16, scale_width=8, scale_height=6):
height, width = frame.shape
width //= scale_width
height //= scale_height
box_width //= scale_width
box_height //= scale_height
for i in range(0, height, box_height):
for j in range(0, width, box_width):
roi = frame_scaled[i:i + box_height, j:j + box_width]
best_match = np.inf
best_match_index = 0
for k in range(1, scaled.shape[0]):
total_sum = np.sum(roi - scaled[k])
if total_sum < best_match:
best_match = total_sum
best_match_index = k
frame[i*scale_height:(i + box_height)*scale_height, j*scale_width:(j + box_width)*scale_width] = images[best_match_index]
return frame
def main(images, scaled):
# Get the webcam (default webcam is 0)
cap = cv2.VideoCapture(0)
# If your webcam does not support 640 x 480, this will find another resolution
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
while True:
# Read the a frame from webcam
_, frame = cap.read()
# Flip the frame
frame = cv2.flip(frame, 1)
frame = cv2.resize(frame, (640, 480))
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Update the frame
start = time.time()
gray_scaled = cv2.resize(gray, (640//8, 480//6))
mosaic_frame = process3(gray, gray_scaled, images, scaled)
print("Process time", time.time()- start, "seconds")
# Show the frame in a window
cv2.imshow('Mosaic Video', mosaic_frame)
cv2.imshow('Webcam', frame)
# Check if q has been pressed to quit
if cv2.waitKey(1) == ord('q'):
break
# When everything done, release the capture
cap.release()
cv2.destroyAllWindows()
images = preprocess()
scaled = preprocess2(images)
main(images, scaled)
Where there is added preprocessing step (preprocess2). The process time is now.
Process time 0.5559628009796143 seconds
Process time 0.5979928970336914 seconds
Process time 0.5543379783630371 seconds
Process time 0.5621011257171631 seconds
Which is okay, but still less than 2 frames per seconds.
The result can be seen here.
It is not all bad. It is still better than the simple video mosaic approach.
The result is not perfect. If you want to use it on a live webcam stream with 25-30 frames per seconds, you need to find further optimizations of live with the simple mosaic video approach.
Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.
Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.
Step 1: Example of Vectorization slower than Numba
In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.
import numpy as np
from numba import jit
import time
size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000
@jit(nopython=True)
def add_numba(a, b):
c = np.zeros(a.shape)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i, j] = a[i, j] + b[i, j]
return c
def add_vectorized(a, b):
return a + b
# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))
Varying the size of the NumPy array, we can see the performance between the two in the graph below.
Where it is clear that the vectorized approach is slower.
Step 2: Try some more complex example comparing vectorized and Numba
A if-then-else can be expressed as vectorized using the Numpywhere function.
import numpy as np
from numba import jit
import time
size = 1000
x = np.random.rand(size, size)
iterations = 1000
@jit(nopython=True)
def numba(a):
c = np.zeros(a.shape)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
if a[i, j] < 0.5:
c[i, j] = 1
return c
def vectorized(a):
return np.where(a < 0.5, 1, 0)
# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))
This results in the following comparison.
That is close, but the vectorized approach is a bit faster.
Step 3: Compare Numba with lambda functions
I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.
import numpy as np
from numba import jit
import time
size = 1000
x = np.random.rand(size, size)
iterations = 1000
@jit(nopython=True)
def numba(a):
c = np.zeros((size, size))
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i, j] = a[i, j] + 1
return c
def lambda_run(a):
return a.apply(lambda x: x + 1)
# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))
Resulting in the following performance comparison.
This is again tight, but the lambda approach is still a bit faster.
Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.
Conclusion
Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.
You just want your code to run fast, right? Numba is a just-in-time compiler for Python that works amazingly with NumPy. Does that mean we should alway use Numba?
Well, let’s try some examples out and learn. If you know about NumPy, you know you should use vectorization to get speed. Does Numba beat that?
Step 1: Let’s learn how Numba works
Numba will compile the Python code into machine code and run it. What about the just-in-time compiler? That means, the first time it uses the code you want to turn into machine code, it will compile it and run it. The next, or any time later, it will just run it, as it is already compiled.
Let’s try that.
import numpy as np
from numba import jit
import time
@jit(nopython=True)
def full_sum_numba(a):
sum = 0.0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
sum += a[i, j]
return sum
iterations = 1000
size = 10000
x = np.random.rand(size, size)
start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))
start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))
Where you get.
Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125
Where you see a difference in runtime.
Oh, did you get what happened in the code? Well, if you put @jit(nopython=True) in front of a function, Numba will try to compile it and run it as machine code.
As you see above, the first time as has an overhead in run-time, because it first compiles and the runs it. The second time, it already has compiled it and can run it immediately.
Step 2: Compare Numba just-in-time code to native Python code
So let us compare how much you gain by using Numba just-in-time (@jit) in our code.
import numpy as np
from numba import jit
import time
def full_sum(a):
sum = 0.0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
sum += a[i, j]
return sum
@jit(nopython=True)
def full_sum_numba(a):
sum = 0.0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
sum += a[i, j]
return sum
iterations = 1000
size = 10000
x = np.random.rand(size, size)
start = time.time()
full_sum(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))
start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))
start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (Numba) = %s" % (end - start))
Here we added a native Python function without the @jit in front and will compare it with one which has. We will compare it here.
Elapsed (No Numba) = 38.08543515205383
Elapsed (No Numba) = 0.41634082794189453
Elapsed (No Numba) = 0.11176300048828125
That is some difference. Also, we have plotted a few more runs in the graph below.
It seems pretty evident.
Step 3: Comparing it with Vectorization
If you don’t know what vectorization is, we can recommend this tutorial. The reason to have vectorization is to move the expensive for-loops into the function call to have optimized code run it.
That sounds a lot like what Numba can do. It can change the expensive for-loops into fast machine code.
But which one is faster?
Well, I think there are two parameters to try out. First, the size of the problem. Second, to see if the number of iterations matter.
import numpy as np
from numba import jit
import time
@jit(nopython=True)
def full_sum_numba(a):
sum = 0.0
for i in range(a.shape[0]):
for j in range(a.shape[1]):
sum += a[i, j]
return sum
def full_sum_vectorized(a):
return a.sum()
iterations = 1000
size = 10000
x = np.random.rand(size, size)
start = time.time()
full_sum_vectorized(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))
start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))
start = time.time()
full_sum_numba(x)
end = time.time()
print("Elapsed (No Numba) = %s" % (end - start))
As a function of the size.
It is interesting that Numba is faster for small sized of the problem, while it seems like the vectorized approach outperforms Numba for bigger sizes.
And not surprisingly, the number of iterations only makes the difference bigger.
This is not surprising, as the code in a vectorized call can be more specifically optimized than the more general purpose Numba approach.
Conclusion
Does that mean the Numba does not pay off to use?
No, not at all. First of all, we have only tried it for one vectorized approach, which was obviously very easy to optimize. Secondly, not all loops can be turned into vectorized code. In general it is difficult to have a state in a vectorized approach. Hence, if you need to keep track of some internal state in a loop it can be difficult to find a vectorized approach.
What is Markowitz Portfolios Optimization (Efficient Frontier)?
The Efficient Frontier takes a portfolio of investments and optimizes the expected return in regards to the risk. That is to find the optimal return for a risk.
It will contain all the date time series for the last 5 years from current date.
Step 2: Calculate the CAGR, returns, and covariance
To calculate the expected return, we use the Compound Average Growth Rate (CAGR) based on the last 5 years. The CAGR is used as investopedia suggest. An alternative that also is being used is the mean of the returns. The key thing is to have some common measure of the return.
The CAGR is calculated as follows.
CAGR = (end-price/start-price)^(1/years) – 1
We will also calculate the covariance as we will use that the calculate the variance of a weighted portfolio. Remember that the standard deviation is given by the following.
sigma = sqrt(variance)
A portfolio is a vector w with the balances of each stock. For example, given w = [0.2, 0.3, 0.4, 0.1], will say that we have 20% in the first stock, 30% in the second, 40% in the third, and 10% in the final stock. It all sums up to 100%.
This is where the power of computing comes into the picture. The idea is to just try a random portfolio and see how it rates with regards to expected return and risk.
It is that simple. Make a random weighted distribution of your portfolio and plot the point of expected return (based on our CAGR) and the risk based on the standard deviation calculated by the covariance.
import matplotlib.pyplot as plt
import numpy as np
def random_weights(n):
k = np.random.rand(n)
return k / sum(k)
exp_return = []
sigma = []
for _ in range(20000):
w = random_weights(len(tickers))
exp_return.append(np.dot(w, cagr.T))
sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))
plt.plot(sigma, exp_return, 'ro', alpha=0.1)
plt.show()
We introduce a helper function random_weights, which returns a weighted portfolio. That is, it returns a vector with entries that sum up to one. This will give a way to distribute our portfolio of stocks.
Then we iterate 20.000 times (could be any value, just want to have enough to plot our graph), where we make a random weight w, then calculate the expected return by the dot-product of w and cagr-transposed. This is done by using NumPy’s dot-product function.
What a dot-product of np.dot(w, cagr.T) does is to take elements pairwise from w and cagr and multiply them and sum up. The transpose is only about the orientation of it to make it work.
The standard deviation (assigned to sigma) is calculated similar by the formula given in the last step: variance = w^T Cov w (which has dot-products between).
This results in the following graph.
Returns vs risks
This shows a graph which outlines a parabola. The optimal values lie along the upper half of the parabola line. Hence, given a risk, the optimal portfolio is one corresponding on the upper boarder of the filled parabola.
Considerations
The Efficient Frontier gives you a way to balance your portfolio. The above code can by trial an error find such a portfolio, but it still leaves out some consideratoins.
How often should you re-balance? It has a cost to do that.
The theory behind has some assumptions that may not be a reality. As investopedia points out, it assumes that asset returns follow a normal distribution, but in reality returns can be more the 3 standard deviations away. Also, the theory builds upon that investors are rational in their investment, which is by most considered a flawed assumption, as more factors play into the investments.
The full source code
Below here you find the full source code from the tutorial.
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
from dateutil.relativedelta import relativedelta
import matplotlib.pyplot as plt
import numpy as np
years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL', 'MSFT', 'IBM', 'NVDA']
for ticker in tickers:
tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
close_price[ticker] = tmp['Close']
returns = close_price / close_price.shift(1)
cagr = (close_price.iloc[-1] / close_price.iloc[0]) ** (1 / years) - 1
cov = returns.cov()
def random_weights(n):
k = np.random.rand(n)
return k / sum(k)
exp_return = []
sigma = []
for _ in range(20000):
w = random_weights(len(tickers))
exp_return.append(np.dot(w, cagr.T))
sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))
plt.plot(sigma, exp_return, 'ro', alpha=0.1)
plt.show()
Understand what the Mandelbrot set it and why it is so fascinating.
Master how to make images in multiple colors of the Mandelbrot set.
How to implement it using NumPy vectorization.
Step 1: What is Mandelbrot?
Mandelbrot is a set of complex numbers for which the function f(z) = z^2 + c does not converge when iterated from z=0 (from wikipedia).
Take a complex number, c, then you calculate the sequence for N iterations:
z_(n+1) = z_n + c for n = 0, 1, …, N-1
If absolute(z_(N-1)) < 2, then it is said not to diverge and is part of the Mandelbrot set.
The Mandelbrot set is part of the complex plane, which is colored by numbers part of the Mandelbrot set and not.
Mandelbrot set.
This only gives a block and white colored image of the complex plane, hence often the images are made more colorful by giving it colors by the iteration number it diverged. That is if z_4 diverged for a point in the complex plane, then it will be given the color 4. That is how you end up with colorful maps like this.
Mandelbrot set (made by program from this tutorial).
Step 2: Understand the code of the non-vectorized approach to compute the Mandelbrot set
To better understand the images from the Mandelbrot set, think of the complex numbers as a diagram, where the real part of the complex number is x-axis and the imaginary part is y-axis (also called the Argand diagram).
Argand diagram
Then each point is a complex number c. That complex number will be given a color depending on which iteration it diverges (if it is not part of the Mandelbrot set).
Now the pseudocode for that should be easy to digest.
for x in [-2, 2] do:
for y in [-1.5, 1.5] do:
c = x + i*y
z = 0
N = 0
while absolute(z) < 2 and N < MAX_ITERATIONS:
z = z^2 + c
set color for x,y to N
Simple enough to understand. That is some of the beauty of it. The simplicity.
Step 3: Make a vectorized version of the computations
Now we understand the concepts behind we should translate that into to a vectorized version. If you are new to vectorization we can recommend you read this tutorial first.
What do we achieve with vectorization? That we compute all the complex numbers simultaneously. To understand that inspect the initialization of all the points here.
import numpy as np
def mandelbrot(height, width, x_from=-2, x_to=1, y_from=-1.5, y_to=1.5, max_iterations=100):
x = np.linspace(x_from, x_to, width).reshape((1, width))
y = np.linspace(y_from, y_to, height).reshape((height, 1))
c = x + 1j * y
You see that we initialize all the x-coordinates at once using the linespace. It will create an array with numbers from x_from to x_to in width points. The reshape is to fit the plane.
The same happens for y.
Then all the complex numbers are created in c = x + 1j*y, where 1j is the imaginary part of the complex number.
This leaves us to the full implementation.
There are two things we need to keep track of in order to make a colorful Mandelbrot set. First, in which iteration the point diverged. Second, to achieve that, we need to remember when a point diverged.
import numpy as np
import matplotlib.pyplot as plt
def mandelbrot(height, width, x=-0.5, y=0, zoom=1, max_iterations=100):
# To make navigation easier we calculate these values
x_width = 1.5
y_height = 1.5*height/width
x_from = x - x_width/zoom
x_to = x + x_width/zoom
y_from = y - y_height/zoom
y_to = y + y_height/zoom
# Here the actual algorithm starts
x = np.linspace(x_from, x_to, width).reshape((1, width))
y = np.linspace(y_from, y_to, height).reshape((height, 1))
c = x + 1j * y
# Initialize z to all zero
z = np.zeros(c.shape, dtype=np.complex128)
# To keep track in which iteration the point diverged
div_time = np.zeros(z.shape, dtype=int)
# To keep track on which points did not converge so far
m = np.full(c.shape, True, dtype=bool)
for i in range(max_iterations):
z[m] = z[m]**2 + c[m]
diverged = np.greater(np.abs(z), 2, out=np.full(c.shape, False), where=m) # Find diverging
div_time[diverged] = i # set the value of the diverged iteration number
m[np.abs(z) > 2] = False # to remember which have diverged
return div_time
# Default image of Mandelbrot set
plt.imshow(mandelbrot(800, 1000), cmap='magma')
# The image below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -0.75, 0.0, 2, 200), cmap='magma')
# The image below of below of Mandelbrot set
# plt.imshow(mandelbrot(800, 1000, -1, 0.3, 20, 500), cmap='magma')
plt.show()
Notice that z[m] = z[m]**2 + c[m] only computes updates on values that are still not diverged.
I have added the following two images from above (the one not commented out is above in previous step.
Mandelbrot set from the program above.Mandelbrot set from the code above.
According to wikipedia, the Sexual Compulsivity Scale (SCS) is a psychometric measure of high libido, hypersexuality, and sexual addiction. While it does not say anything about the score itself, it is based on people rating 10 questions from 1 to 4.
The questions are the following.
Q1. My sexual appetite has gotten in the way of my relationships.
Q2. My sexual thoughts and behaviors are causing problems in my life.
Q3. My desires to have sex have disrupted my daily life.
Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.
Q5. I sometimes get so horny I could lose control.
Q6. I find myself thinking about sex while at work.
Q7. I feel that sexual thoughts and feelings are stronger than I am.
Q8. I have to struggle to control my sexual thoughts and behavior.
Q9. I think about sex more than I would like to.
Q10. It has been difficult for me to find sex partners who desire having sex as much as I want to.
The questions are rated as follows (1=Not at all like me, 2=Slightly like me, 3=Mainly like me, 4=Very much like me).
A dataset of more than 3300+ responses can be found here, which includes the individual rating of each questions, the total score (the sum of ratings), age and gender.
Step 1: First inspection of the data.
Inspection of the data (CSV file)
The first question that pops into my mind is how men and women rate themselves differently. How can we efficiently figure that out?
Welcome to NumPy. It has a built-in csv reader that does all the hard work in the genfromtxt function.
import numpy as np
data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')
# Skip first row as it has description
data = data[1:]
men = data[data[:,11] == 1]
women = data[data[:,11] == 2]
print("Men average", men.mean(axis=0))
print("Women average", women.mean(axis=0))
Dividing into men and women is easy with NumPy, as you can make a vectorized conditional inside the dataset. Men are coded with 1 and women with 2 in column 11 (the 12th column). Finally, a call to mean will do the rest.
Men average [ 2.30544662 2.2453159 2.23485839 1.92636166 2.17124183 3.06448802
2.19346405 2.28496732 2.43660131 2.54204793 23.40479303 1.
32.54074074]
Women average [ 2.30959164 2.18993352 2.19088319 1.95916429 2.38746439 3.13010446
2.18518519 2.2991453 2.4985755 2.43969611 23.58974359 2.
27.52611586]
Interestingly, according to this dataset (which should be accounted for accuracy, where 21% of answers were not used) women are scoring slighter higher SCS than men.
Men rate highest on the following question:
Q6. I find myself thinking about sex while at work.
While women rate highest on this question.
Q6. I find myself thinking about sex while at work.
The same. Also the lowest is the same for both genders.
Q4. I sometimes fail to meet my commitments and responsibilities because of my sexual behaviors.
Step 2: Visualize age vs score
I would guess that the SCS score decreases with age. Let’s see if that is the case.
Again, NumPy can do the magic easily. That is prepare the data. To visualize it we use matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python.
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')
# Skip first row as it has description
data = data[1:]
score = data[:,10]
age = data[:,12]
age[age > 100] = 0
plt.scatter(age, score, alpha=0.05)
plt.show()
Resulting in this plot.
Age vs SCS score.
It actually does not look like any correlation. Remember, there are more young people responding to the survey.
Let’s ask NumPy what it thinks about correlation here? Luckily we can do that by calling the corrcoef function, which calculates the Pearson product-moment correlation coefficients.
print("Correlation of age and SCS score:", np.corrcoef(age, score))
Resulting in this output.
Correlation of age and SCS score:
[[1. 0.01046882]
[0.01046882 1. ]]
Saying no correlation, as 0.0 – 0.3 is a small correlation, hence, 0.01046882 is close to none. Does that mean the the SCS score does not correlate with age? That our SCS score is static through life?
I do not think we can conclude that based on this small dataset.
Step 3: Bar plot the distribution of scores
It also looked like in the graph we plotted that there was a close to even distribution of scores.
Let’s try to see that. Here we need to sum participants by group. NumPy falls a bit short here. But let’s keep the good mood and use plain old Python lists.
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('scs.csv', delimiter=',', dtype='int')
# Skip first row as it has description
data = data[1:]
scores = []
numbers = []
for i in range(10, 41):
numbers.append(i)
scores.append(data[data[:, 10] == i].shape[0])
plt.bar(numbers, scores)
plt.show()
Resulting in this bar plot.
Count participants by score.
We knew that the average score was around 23, which could give a potential evenly distribution. But it seems to be a little lower in the far high end of SCS score.
Narcissism in personality trait generally conceived of as excessive self love. In Greek mythology Narcissus was a man who fell in love with his reflection in a pool of water.
The only connection between NPI and NumPy is that we want to analyze the 11.000+ answers.
The dataset can be downloaded here, which consists of a comma separated file, or CSV file for short and a description.
Step 1: Import the dataset and explore it
NumPy has thought of it for us, as simple as magic to load the dataset (in from the link above).
import numpy as np
# This magic line loads the 11.000+ lines of data to a ndarray
data = np.genfromtxt('data.csv', delimiter=',', dtype='int')
# Skip first row
data = data[1:]
print(data)
A good idea is to investigate it from a spreadsheet as well to investigate it.
Spreadsheet
And the far end.
Spreadsheet
Oh, that end.
Then investigate the description from the dataset. (Here we have some of it).
For questions 1=40 which choice they chose was recorded per the following key.
... [The questions Q1 ... Q40]
...
gender. Chosen from a drop down list (1=male, 2=female, 3=other; 0=none was chosen).
age. Entered as a free response. Ages below 14 have been ommited from the dataset.
-- CALCULATED VALUES --
elapse. (time submitted)-(time loaded) of the questions page in seconds.
score. = ((int) $_POST['Q1'] == 1)
... [How it is calculated]
That means we score, answers to questions, elapsed time to answer, gender and age.
Reading a bit more, it says that a high score is an indicator for having narcissistic traits, but one should not conclude that it is one.
Step 2: Men or Women highest NPI?
I’m glad you asked.
import numpy as np
data = np.genfromtxt('data.csv', delimiter=',', dtype='int')
# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]
print("Average score", npi_score.mean())
print("Men average", npi_score[data[:,42] == 1].mean())
print("Women average", npi_score[data[:,42] == 2].mean())
print("None average", npi_score[data[:,42] == 0].mean())
print("Other average", npi_score[data[:,42] == 3].mean())
Before looking at the result, see how nice the data the first column is sliced out to the view in npi_score. Then notice how easy you can calculate the mean based on a conditional rules to narrow the view.
Average score 13.29965311749533
Men average 14.195953307392996
Women average 12.081829626521191
None average 11.916666666666666
Other average 14.85
I guess you guessed it. Men score higher.
Step 3: Is there a correlation between age and NPI score?
I wonder about that too.
How can we figure that out? Wait, let’s ask our new friend NumPy.
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt('data.csv', delimiter=',', dtype='int')
# Skip first row
data = data[1:]
# Extract all the NPI scores (first column)
npi_score = data[:,0]
age = data[:,43]
# Some age values are not real, so we adjust them to 0
age[age>100] = 0
# Scatter plot them all with alpha=0.05
plt.scatter(age, npi_score, color='r', alpha=0.05)
plt.show()
Resulting in.
Plotting age vs NPI
That looks promising. But can we just conclude that younger people score higher NPI?
What if most respondent are young, then that would make the picture more dense in the younger end (15-30). The danger with your eye is making fast conclusions.
Luckily, NumPy can help us there as well.
print(np.corrcoef(npi_score, age))
Resulting in.
Correlation of NPI score and age:
[[ 1. -0.23414633]
[-0.23414633 1. ]]
What does that mean? Well, looking at the documentation of np.corroef():
How to collect data from a HTML table into a Pandas DataFrame.
The cleaning process and how to convert the data into the correct type.
Also, dealing with some data points that are not in correct representation.
Finally, how to sum up by countries.
Step 1: Collect the data from the table
Pandas is an amazing library with a lot of useful data analysis functionality right out of the box. First step in any data analysis is to collect the data. In this tutorial we will collect the data from wikipedia’s page on List of metro systems.
If you are new to the pandas library we recommend you read the this tutorial.
The objective will be to find the sums of Stations, Systems length, and Annual ridership per each country.
From wikipedia.org
At first glance this looks simple, but looking further down we see that some countries have various rows.
From wikipedia.org
Also, some rows do not have all the values needed.
First challenge first. Read the data from the table into a DataFrame, which is the main data structure of the pandas library. The read_html call from a pandas will return a list of DataFrames.
If you use read_html for the first time, we recommend you read this tutorial.
Which results in the following output (or the top of it).
City Country Name Yearopened Year of lastexpansion Stations System length Annual ridership(millions)
0 Algiers Algeria Algiers Metro 2011[13] 2018[14] 19[14] 18.5 km (11.5 mi)[15] 45.3 (2019)[R 1]
1 Buenos Aires Argentina Buenos Aires Underground 1926[Nb 1] 2019[16] 90[17] 56.7 km (35.2 mi)[17] 337.7 (2018)[R 2]
2 Yerevan Armenia Yerevan Metro 1981[18] 1996[19] 10[18] 13.4 km (8.3 mi)[18] 18.7 (2018)[R 3]
3 Sydney Australia Sydney Metro 2019[20] – 13[20] 36 km (22 mi)[20][21] 14.2 (2019) [R 4][R Nb 1]
4 Vienna Austria Vienna U-Bahn 1976[22][Nb 2] 2017[23] 98[24] 83.3 km (51.8 mi)[22] 463.1 (2018)[R 6]
5 Baku Azerbaijan Baku Metro 1967[25] 2016[25] 25[25] 36.6 km (22.7 mi)[25] 231.0 (2018)[R 3]
We have now have the data in a DataFrame.
Step 2: Clean and convert the data
At first glance, we see that we do not need the rows City, Name, Yearopened, Year of last expansion. To make it easier to work with the data, let’s remove them and inspect the data again.
Country Stations System length Annual ridership(millions)
0 Algeria 19[14] 18.5 km (11.5 mi)[15] 45.3 (2019)[R 1]
1 Argentina 90[17] 56.7 km (35.2 mi)[17] 337.7 (2018)[R 2]
2 Armenia 10[18] 13.4 km (8.3 mi)[18] 18.7 (2018)[R 3]
3 Australia 13[20] 36 km (22 mi)[20][21] 14.2 (2019) [R 4][R Nb 1]
4 Austria 98[24] 83.3 km (51.8 mi)[22] 463.1 (2018)[R 6]
5 Azerbaijan 25[25] 36.6 km (22.7 mi)[25] 231.0 (2018)[R 3]
6 Belarus 29[27] 37.3 km (23.2 mi)[27] 283.4 (2018)[R 3]
7 Belgium 59[28][Nb 5] 39.9 km (24.8 mi)[29] 165.3 (2019)[R 7]
This makes it easier to see the next steps.
Let’s take them one by one. Stations need to remove the data after ‘[‘-symbol and convert the number to an integer. This can be done by using a lambda function to a row.
If you are new to lambda functions we recommend you read this tutorial.
The next thing we need to do is to convert the System length to floats. The length will be in km (I live in Denmark, where we use km and not mi). This can also be done by using a lambda function
Finally, and a bit more tricky, we need to convert the column of Annual ridership. The challenge is that lines have n/a which are converted to np.nan, but there are also some lines where the input is not easy to convert, as the images show.
From wikipedia.orgFrom wikipedia.org
These lines are can be dealt with by using a helper function.
def to_float(obj):
try:
return float(obj)
except:
return np.nan
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
Adding this all together we get the following code.
import pandas as pd
import numpy as np
def to_float(obj):
try:
return float(obj)
except:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
print(table)
Which results in the following output (or the first few lines).
Country Stations System length Annual ridership(millions)
0 Algeria 19 18.50 45.30
1 Argentina 90 56.70 337.70
2 Armenia 10 13.40 18.70
3 Australia 13 36.00 14.20
4 Austria 98 83.30 463.10
5 Azerbaijan 25 36.60 231.00
6 Belarus 29 37.30 283.40
7 Belgium 59 39.90 165.30
8 Brazil 19 28.10 58.40
9 Brazil 25 42.40 42.80
10 Brazil 22 43.80 51.70
Step 3: Sum rows by country
Say, now we want to get the country with the most metro stations. This can be achieved by using the groupby and sum function from the pandas DataFrame data structure.
import pandas as pd
import numpy as np
def to_float(obj):
try:
return float(obj)
except:
return np.nan
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
tables = pd.read_html(url)
table = tables[0]
table = table.drop(['City', 'Name', 'Yearopened', 'Year of lastexpansion'], axis=1)
table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)
index = 'Annual ridership(millions)'
table[index] = table.apply(lambda row: to_float(row[index].split()[0]) if row[index] is not np.nan else np.nan, axis=1)
# Sum up
table_sum = table.groupby(['Country']).sum()
print(table_sum.sort_values(['Stations'], ascending=False))
Where the result will be China.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
United States 1005 1325.90 2771.50
South Korea 714 839.90 4054.90
Japan[Nb 34] 669 791.20 6489.60
India 499 675.97 1377.00
France 483 350.90 2113.50
Spain 438 474.40 1197.90
If we want to sort by km of System length, you will only need to change the last line to the following.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
United States 1005 1325.90 2771.50
South Korea 714 839.90 4054.90
Japan[Nb 34] 669 791.20 6489.60
India 499 675.97 1377.00
Russia 368 611.50 3507.60
United Kingdom 390 523.90 1555.30
Finally, if you want it by Annual ridership, you will need to change the last line to.
Remember, we assigned that to index. You should get the following output.
Stations System length Annual ridership(millions)
Country
China 3738 6312.16 25519.23
Japan[Nb 34] 669 791.20 6489.60
South Korea 714 839.90 4054.90
Russia 368 611.50 3507.60
United States 1005 1325.90 2771.50
France 483 350.90 2113.50
Brazil 243 345.40 2106.20
import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
print(table)
The call to read_html will return all the tables in a list. By inspecting the results you will notice that we are interested in table 9, 12 and 15 and merge them. The output of the above will be.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
row = table.iloc[1]
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)
regr = LinearRegression()
regr.fit(X, Y)
Y_pred = regr.predict(X)
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
Which will result in the following plot.
Linear regression model applied on data from wikipedia.org
Which shows that the model approximates a line through the 30 years of data to estimate the growth of the country’s GDP.
Notice that we use the product (cumprod) of pct_change to be able to compare the data. If we used the data directly, we would not be possible to compare it.
We will do that for all countries to get a view of the growth. We are using the coefficient of the line, which indicates the growth rate.
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
coef = []
countries = []
for index, row in table.iterrows():
#print(row)
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)
regr = LinearRegression()
regr.fit(X, Y)
coef.append(regr.coef_[0][0])
countries.append(row[merge_index])
data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])
print(data)
Which results in the following output (or the first few lines).
Country Coef
0 Afghanistan 0.161847
1 Albania 0.243493
2 Algeria 0.103907
3 Angola 0.423919
4 Antigua and Barbuda 0.087863
5 Argentina 0.090837
6 Armenia 4.699598
Step 3: Merge the data to a leaflet map using folium
The last step is to merge the data together with the leaflet map using the folium library. If you are new to folium we recommend you read this tutorial.
import pandas as pd
import folium
import geopandas
from sklearn.linear_model import LinearRegression
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# Merge the tables into one table
merge_index = 'Country (or dependent territory)'
table = tables[9].merge(tables[12], how="left", left_on=[merge_index], right_on=[merge_index])
table = table.merge(tables[15], how="left", left_on=[merge_index], right_on=[merge_index])
coef = []
countries = []
for index, row in table.iterrows():
X = table.columns[1:].to_numpy().reshape(-1, 1)
X = X.astype(int)
Y = 1 + row.iloc[1:].pct_change()
Y = Y.cumprod().fillna(1.0).to_numpy()
Y = Y.reshape(-1, 1)
regr = LinearRegression()
regr.fit(X, Y)
coef.append(regr.coef_[0][0])
countries.append(row[merge_index])
data = pd.DataFrame(list(zip(countries, coef)), columns=['Country', 'Coef'])
# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Replace United States of America to United States to fit the naming in the table
world = world.replace('United States of America', 'United States')
# Merge the two DataFrames together
table = world.merge(data, how="left", left_on=['name'], right_on=['Country'])
# Clean data: remove rows with no data
table = table.dropna(subset=['Coef'])
# We have 10 colors available resulting into 9 cuts.
table['Cat'] = pd.qcut(table['Coef'], 9, labels=[0, 1, 2, 3, 4, 5, 6, 7, 8])
print(table)
# Create a map
my_map = folium.Map()
# Add the data
folium.Choropleth(
geo_data=table,
name='choropleth',
data=table,
columns=['Country', 'Cat'],
key_on='feature.properties.name',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Growth of GDP since 1990',
threshold_scale=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
).add_to(my_map)
my_map.save('gdp_growth.html')
There is a twist in the way it is done. Instead of using a linear model to represent the growth rate on the map, we chose to add them in categories. The reason is that otherwise most countries group in small segment.
Here we have used the qcut to add them in each equal sized group.
This should result in an interactive html page looking something like this.
To read the content you can use the read_html(url) call from the pandas library. You need to instal lxml as well, see this post of details.
import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# The data is in the first table
table = tables[0]
print(table[:20])
Which will result in the following output.
Country/Region Average male height ... Year Source
0 Albania 174.0 cm (5 ft 8 1⁄2 in) ... 2008–2009 [11][12]
1 Argentina NaN ... 2004–2005 [13]
2 Argentina 174.46 cm (5 ft 8 1⁄2 in) ... 1998–2001 [14]
3 Armenia NaN ... 2005 [15]
4 Australia 175.6 cm (5 ft 9 in) ... 2011–2012 [16]
5 Austria 179 cm (5 ft 10 1⁄2 in) ... 2006 [17]
6 Azerbaijan 171.8 cm (5 ft 7 1⁄2 in) ... 2005 [18]
7 Bahrain 165.1 cm (5 ft 5 in) ... 2002 [19]
8 Bahrain 171.0 cm (5 ft 7 1⁄2 in) ... 2009 [20][21]
9 Bangladesh NaN ... 2007 [15]
10 Country/Region Average male height ... Year Source
11 Belgium 178.6 cm (5 ft 10 1⁄2 in) ... 2001 [22]
12 Benin NaN ... 2006 [15]
13 Bolivia NaN ... 2003 [15]
14 Bolivia 160.0 cm (5 ft 3 in) ... 1970 [23]
15 Bosnia and Herzegovina 183.9 cm (6 ft 0 in) ... 2014 [24]
16 Brazil 170.7 cm (5 ft 7 in) ... 2009 [25][26]
17 Brazil – Urban 173.5 cm (5 ft 8 1⁄2 in) ... 2009 [25]
18 Brazil – Rural 170.9 cm (5 ft 7 1⁄2 in) ... 2009 [25]
19 Bulgaria 175.2 cm (5 ft 9 in) ... 2010 [27]
Where you by inspection of line 10 see a line of input that needs to be cleaned.
Step 2: Some basic cleaning of the data
By inspection of the data you see that every 10 lines (or something) an line repeats the column names.
From wikipedia.org
While this is practical if you inspect the data as a user, this seems to be annoying for us when we want to use the raw data.
Luckily this is easy to clean up using pandas.
import pandas as pd
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# The data is in the first table
table = tables[0]
# To avoid writing it all the time
AVG_MH = 'Average male height'
# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()
print(table[:20])
Where you can see the data is has cleaned up these columns.
Country/Region Average male height ... Year Source
0 Albania 174.0 cm (5 ft 8 1⁄2 in) ... 2008–2009 [11][12]
1 Argentina NaN ... 2004–2005 [13]
2 Argentina 174.46 cm (5 ft 8 1⁄2 in) ... 1998–2001 [14]
3 Armenia NaN ... 2005 [15]
4 Australia 175.6 cm (5 ft 9 in) ... 2011–2012 [16]
5 Austria 179 cm (5 ft 10 1⁄2 in) ... 2006 [17]
6 Azerbaijan 171.8 cm (5 ft 7 1⁄2 in) ... 2005 [18]
7 Bahrain 165.1 cm (5 ft 5 in) ... 2002 [19]
8 Bahrain 171.0 cm (5 ft 7 1⁄2 in) ... 2009 [20][21]
9 Bangladesh NaN ... 2007 [15]
11 Belgium 178.6 cm (5 ft 10 1⁄2 in) ... 2001 [22]
12 Benin NaN ... 2006 [15]
13 Bolivia NaN ... 2003 [15]
14 Bolivia 160.0 cm (5 ft 3 in) ... 1970 [23]
15 Bosnia and Herzegovina 183.9 cm (6 ft 0 in) ... 2014 [24]
16 Brazil 170.7 cm (5 ft 7 in) ... 2009 [25][26]
17 Brazil – Urban 173.5 cm (5 ft 8 1⁄2 in) ... 2009 [25]
18 Brazil – Rural 170.9 cm (5 ft 7 1⁄2 in) ... 2009 [25]
19 Bulgaria 175.2 cm (5 ft 9 in) ... 2010 [27]
20 Burkina Faso NaN ... 2003 [15]
Step 3: Convert data to floats
Inspecting the data that we need (Average male height) it is represented as a string with both the cm and ft/in figure. As I live in Denmark and we use the metric system and have never really understood any benefit of the US customary units (feel free to enlighten me).
Hence, we want to convert the strings in the column Average male height to a float representing the height in cm.
Notice, that some are NaN, while the rest are having the first number as the length in cm.
We can exploit that and convert it with a lambda function. If you are new to lambda functions you can see this tutorial.
import pandas as pd
import numpy as np
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# The data is in the first table
table = tables[0]
# To avoid writing it all the time
AVG_MH = 'Average male height'
AMH_F = 'Aveage male height (float)'
# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()
# Clean up data to have height in cm
table[AMH_F] = table.apply(lambda row: float(row[AVG_MH].split(' ')[0]) if row[AVG_MH] is not np.nan else np.nan,
axis=1)
print(table[:20])
Resulting in the following.
Country/Region ... Aveage male height (float)
0 Albania ... 174.00
1 Argentina ... NaN
2 Argentina ... 174.46
3 Armenia ... NaN
4 Australia ... 175.60
5 Austria ... 179.00
6 Azerbaijan ... 171.80
7 Bahrain ... 165.10
8 Bahrain ... 171.00
9 Bangladesh ... NaN
11 Belgium ... 178.60
12 Benin ... NaN
13 Bolivia ... NaN
14 Bolivia ... 160.00
15 Bosnia and Herzegovina ... 183.90
16 Brazil ... 170.70
17 Brazil – Urban ... 173.50
18 Brazil – Rural ... 170.90
19 Bulgaria ... 175.20
20 Burkina Faso ... NaN
Notice that np.nan is also a float and hence, the full column Average male height (float) are floats.
Step 4: Merge two sets of data with different representations of countries
To make the map in the end we will use the geopandas library, which has a nice low resolution dataset used to color countries. While the data by geopandas is represented as a DataFrame it is difficult to merge it as the DataFrame we have created from the htm_read call to pandas has varying names.
Example can be United States in the one we created and United States of America in the geopandas. Hence, we need some means to map them to the same representation.
Hence, applying that to both DataFrames we can merge them.
import pandas as pd
import numpy as np
import geopandas
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# The data is in the first table
table = tables[0]
# To avoid writing it all the time
AVG_MH = 'Average male height'
CR = 'Country/Region'
COUNTRY = 'Country'
AMH_F = 'Aveage male height (float)'
A3 = 'alpha3'
# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()
# Clean up data to have height in cm
table[AMH_F] = table.apply(lambda row: float(row[AVG_MH].split(' ')[0]) if row[AVG_MH] is not np.nan else np.nan,
axis=1)
# Clean up the names if used a dash before
table[COUNTRY] = table.apply(
lambda row: row[CR].split(' – ')[0] if ' – ' in row[CR] else row[CR],
axis=1)
# Map the country name to the alpha3 representation
table[A3] = table.apply(lambda row: lookup_country_code(row[COUNTRY]), axis=1)
# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Do the same mapping to alpha3
world[A3] = world.apply(lambda row: lookup_country_code(row['name']), axis=1)
# Merge the data
table = world.merge(table, how="left", left_on=[A3], right_on=[A3])
# Remove countries with no data
table = table.dropna(subset=[AMH_F])
# These lines are just used to get the full data
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
print(table)
Which will result in the following.
pop_est continent name iso_a3 gdp_md_est geometry alpha3 Country/Region Average male height Average female height Stature ratio(male to female) Sample population / age range Share ofpop. over 18covered[9][10] Methodology Year Source Aveage male height (float) Country
3 35623680 North America Canada CAN 1674000.0 MULTIPOLYGON (((-122.84000 49.00000, -122.9742... CAN Canada 175.1 cm (5 ft 9 in) 162.3 cm (5 ft 4 in) 1.08 18–79 94.7% Measured 2007–2009 [29] 175.10 Canada
4 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA United States 175.3 cm (5 ft 9 in) 161.5 cm (5 ft 3 1⁄2 in) 1.09 All Americans, 20+ (N= m:5,232 f:5,547, Median... 69% Measured 2011–2014 [132] 175.30 United States
5 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA United States – African Americans 175.5 cm (5 ft 9 in) 162.6 cm (5 ft 4 in) 1.08 African Americans, 20–39 (N= m:532 f:612, Medi... 3.4%[133] Measured 2015-2016 [134] 175.50 United States
6 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA United States – Hispanic and Latino Americans 169.5 cm (5 ft 6 1⁄2 in) 156.7 cm (5 ft 1 1⁄2 in) 1.08 Hispanic/Latin-Americans, 20–39 (N= m:745 f:91... 4.4%[133] Measured 2015–2016 [134] 169.50 United States
7 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA United States – Mexican Americans 168.8 cm (5 ft 6 1⁄2 in) 156.1 cm (5 ft 1 1⁄2 in) 1.09 Mexican Americans, 20–39 (N= m:429 f:511, Medi... 2.8%[133] Measured 2015–2016 [134] 168.80 United States
8 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA United States – Asian Americans 169.7 cm (5 ft 7 in) 156.2 cm (5 ft 1 1⁄2 in) 1.09 Non-Hispanic Asians, 20–39 (N= m:323 f:326, Me... 1.3%[133] Measured 2015–2016 [134] 169.70 United States
9 326625791 North America United States of America USA 18560000.0 MULTIPOLYGON (((-122.84000 49.00000, -120.0000... USA United States – Non-Hispanic whites 177.0 cm (5 ft 9 1⁄2 in) 163.3 cm (5 ft 4 1⁄2 in) 1.08 Non-Hispanic White Americans, 20–39 (N= m:892 ... 17.1%[133] Measured 2015–2016 [134] 177.00 United States
13 260580739 Asia Indonesia IDN 3028000.0 MULTIPOLYGON (((141.00021 -2.60015, 141.01706 ... IDN Indonesia 158 cm (5 ft 2 in) 147 cm (4 ft 10 in) 1.07 50+ (N= m:2,041 f:2,396, Median= m:158 cm (5 f... 22.5% Self-reported 1997 [59] 158.00 Indonesia
15 44293293 South America Argentina ARG 879400.0 MULTIPOLYGON (((-68.63401 -52.63637, -68.25000... ARG Argentina 174.46 cm (5 ft 8 1⁄2 in) 161.01 cm (5 ft 3 1⁄2 in) 1.08 Healthy, 18 (N= m:90 f:97, SD= m:7.43 cm (3 in... 2.9% Measured 1998–2001 [14] 174.46 Argentina
16 17789267 South America Chile CHL 436100.0 MULTIPOLYGON (((-68.63401 -52.63637, -68.63335... CHL Chile 169.6 cm (5 ft 7 in) 156.1 cm (5 ft 1 1⁄2 in) 1.09 15+ 107.2% Measured 2009–2010 [30] 169.60 Chile
19 47615739 Africa Kenya KEN 152700.0 POLYGON ((39.20222 -4.67677, 37.76690 -3.67712... KEN Kenya 169.6 cm (5 ft 7 in) NaN NaN 25–49 (N= f:1,600, SD= f:6.3 cm (2 1⁄2 in)) 53.7% Summary 2016 [69] 169.60 Kenya
20 47615739 Africa Kenya KEN 152700.0 POLYGON ((39.20222 -4.67677, 37.76690 -3.67712... KEN Kenya 169.6 cm (5 ft 7 in) 158.2 cm (5 ft 2 1⁄2 in) NaN 25–49 (N= f:4,856, SD= f:7.3 cm (3 in)) 52.5% Survey 2016 [15][69] 169.60 Kenya
25 142257519 Europe Russia RUS 3745000.0 MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... Russia Russia 171.1 cm (5 ft 7 1⁄2 in) 158.2 cm (5 ft 2 1⁄2 in) 1.08 44-69 (N= m: 3892 f: 4643) 38.5% Measured 2007 [93] 171.10 Russia
26 142257519 Europe Russia RUS 3745000.0 MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... Russia Russia 177.2 cm (5 ft 10 in) 164.1 cm (5 ft 4 1⁄2 in) 1.08 24 1.9% Measured 2004 [21][98] 177.20 Russia
29 5320045 Europe Norway -99 364700.0 MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... NOR Norway 179.7 cm (5 ft 10 1⁄2 in) 167.1 cm (5 ft 6 in) 1.09 Conscripts, 18–44 (N= m:30,884 f:28,796) 35.3% Measured 2012 [88] 179.70 Norway
30 5320045 Europe Norway -99 364700.0 MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... NOR Norway 179.7 cm (5 ft 10 1⁄2 in) 167 cm (5 ft 5 1⁄2 in) 1.08 20–85 (N= m:1534 f:1743) 93.6% Self-reported 2008–2009 [9][26][89] 179.70 Norway
34 54841552 Africa South Africa ZAF 739100.0 POLYGON ((16.34498 -28.57671, 16.82402 -28.082... ZAF South Africa 168 cm (5 ft 6 in) 159 cm (5 ft 2 1⁄2 in) 1.06 19 (N= m:121 f:118) 3.6% Measured 2003 [110] 168.00 South Africa
36 124574795 North America Mexico MEX 2307000.0 POLYGON ((-117.12776 32.53534, -115.99135 32.6... MEX Mexico 172 cm (5 ft 7 1⁄2 in) 159 cm (5 ft 2 1⁄2 in) 1.08 20–65 62.0% Measured 2014 [83] 172.00 Mexico
37 3360148 South America Uruguay URY 73250.0 POLYGON ((-57.62513 -30.21629, -56.97603 -30.1... URY Uruguay 170 cm (5 ft 7 in) 158 cm (5 ft 2 in) 1.08 Adults (N= m:2,249 f:2,114) NaN Measured 1990 [135] 170.00 Uruguay
38 207353391 South America Brazil BRA 3081000.0 POLYGON ((-53.37366 -33.76838, -53.65054 -33.2... BRA Brazil 170.7 cm (5 ft 7 in) 158.8 cm (5 ft 2 1⁄2 in) 1.07 18+ (N= m:62,037 f:65,696) 100.0% Measured 2009 [25][26] 170.70 Brazil
39 207353391 South America Brazil BRA 3081000.0 POLYGON ((-53.37366 -33.76838, -53.65054 -33.2... BRA Brazil – Urban 173.5 cm (5 ft 8 1⁄2 in) 161.6 cm (5 ft 3 1⁄2 in) 1.07 20–24 (N= m:6,360 f:6,305) 10.9% Measured 2009 [25] 173.50 Brazil
40 207353391 South America Brazil BRA 3081000.0 POLYGON ((-53.37366 -33.76838, -53.65054 -33.2... BRA Brazil – Rural 170.9 cm (5 ft 7 1⁄2 in) 158.9 cm (5 ft 2 1⁄2 in) 1.07 20–24 (N= m:1,939 f:1,633) 2.1% Measured 2009 [25] 170.90 Brazil
42 11138234 South America Bolivia BOL 78350.0 POLYGON ((-69.52968 -10.95173, -68.78616 -11.0... BOL Bolivia 160.0 cm (5 ft 3 in) 142.2 cm (4 ft 8 in) 1.13 Aymara, 20–29 NaN Measured 1970 [23] 160.00 Bolivia
43 31036656 South America Peru PER 410400.0 POLYGON ((-69.89364 -4.29819, -70.79477 -4.251... PER Peru 164 cm (5 ft 4 1⁄2 in) 151 cm (4 ft 11 1⁄2 in) 1.09 20+ 0.011509% Measured 2005 [90] 164.00 Peru
44 47698524 South America Colombia COL 688000.0 POLYGON ((-66.87633 1.25336, -67.06505 1.13011... COL Colombia 170.6 cm (5 ft 7 in) 158.7 cm (5 ft 2 1⁄2 in) 1.07 18–22 (N= m:1,528,875 f:1,468,110) 14.1% Measured 2002 [33] 170.60 Colombia
56 67106161 Europe France -99 2699000.0 MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... FRA France 175.6 cm (5 ft 9 in) 162.5 cm (5 ft 4 in) 1.08 18–70 (N= m/f:11,562) 85.9% Measured 2003–2004 [45][46] 175.60 France
57 67106161 Europe France -99 2699000.0 MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... FRA France 174.1 cm (5 ft 8 1⁄2 in) 161.9 cm (5 ft 3 1⁄2 in) 1.08 20+ 96.6% Measured 2001 [7] 174.10 France
58 16290913 South America Ecuador ECU 182400.0 POLYGON ((-75.37322 -0.15203, -75.23372 -0.911... ECU Ecuador 167.1 cm (5 ft 6 in) 154.2 cm (5 ft 1⁄2 in) 1.08 NaN NaN Measured 2014 [40] 167.10 Ecuador
60 2990561 North America Jamaica JAM 25390.0 POLYGON ((-77.56960 18.49053, -76.89662 18.400... JAM Jamaica 171.8 cm (5 ft 7 1⁄2 in) 160.8 cm (5 ft 3 1⁄2 in) 1.07 25–74 71.4% Measured 1994–1996 [66] 171.80 Jamaica
61 11147407 North America Cuba CUB 132900.0 POLYGON ((-82.26815 23.18861, -81.40446 23.117... CUB Cuba – Urban 168 cm (5 ft 6 in) 156 cm (5 ft 1 1⁄2 in) 1.08 15+ 79.2% Measured 1999 [35] 168.00 Cuba
66 17885245 Africa Mali MLI 38090.0 POLYGON ((-11.51394 12.44299, -11.46790 12.754... MLI Mali – Southern Mali 171.3 cm (5 ft 7 1⁄2 in) 160.4 cm (5 ft 3 in) 1.07 Rural adults (N= m:121 f:320, SD= m:6.6 cm (2 ... NaN Measured 1992 [81] 171.30 Mali
70 190632261 Africa Nigeria NGA 1089000.0 POLYGON ((2.69170 6.25882, 2.74906 7.87073, 2.... NGA Nigeria 163.8 cm (5 ft 4 1⁄2 in) 157.8 cm (5 ft 2 in) 1.04 18–74 98.6% Measured 1994–1996 [66] 163.80 Nigeria
71 190632261 Africa Nigeria NGA 1089000.0 POLYGON ((2.69170 6.25882, 2.74906 7.87073, 2.... NGA Nigeria 167.2 cm (5 ft 6 in) 160.3 cm (5 ft 3 in) 1.04 20–29 (N= m:139 f:76, SD= m:6.5 cm (2 1⁄2 in) ... 33.2% Measured 2011 [87] 167.20 Nigeria
72 24994885 Africa Cameroon CMR 77240.0 POLYGON ((14.49579 12.85940, 14.89336 12.21905... CMR Cameroon – Urban 170.6 cm (5 ft 7 in) 161.3 cm (5 ft 3 1⁄2 in) 1.06 15+ (N= m:3,746 f:5,078) 53.6% Measured 2003 [28] 170.60 Cameroon
75 27499924 Africa Ghana GHA 120800.0 POLYGON ((0.02380 11.01868, -0.04978 10.70692,... GHA Ghana 169.5 cm (5 ft 6 1⁄2 in) 158.5 cm (5 ft 2 1⁄2 in) 1.07 25–29 14.7% Measured 1987–1989 [49] 169.50 Ghana
87 19196246 Africa Malawi MWI 21200.0 POLYGON ((32.75938 -9.23060, 33.73972 -9.41715... MWI Malawi – Urban 166 cm (5 ft 5 1⁄2 in) 155 cm (5 ft 1 in) 1.07 16–60 (N= m:583 f:315, SD= m:6.0 cm (2 1⁄2 in)... 101.1% Measured 2000 [78] 166.00 Malawi
92 8299706 Asia Israel ISR 297000.0 POLYGON ((35.71992 32.70919, 35.54567 32.39399... ISR Israel 177 cm (5 ft 9 1⁄2 in) 166 cm (5 ft 5 1⁄2 in) 1.07 18–21 9.7% Measured 2010 [64] 177.00 Israel
96 2051363 Africa Gambia GMB 3387.0 POLYGON ((-16.71373 13.59496, -15.62460 13.623... GMB Gambia – Rural 168.0 cm (5 ft 6 in) 157.8 cm (5 ft 2 in) 1.06 21–49 (N= m:9,559 f:13,160, SD= m:6.7 cm (2 1⁄... NaN Measured 1950–1974 [47] 168.00 Gambia
100 6072475 Asia United Arab Emirates ARE 667200.0 POLYGON ((51.57952 24.24550, 51.75744 24.29407... ARE United Arab Emirates 173.4 cm (5 ft 8 1⁄2 in) 156.4 cm (5 ft 1 1⁄2 in) 1.11 NaN NaN NaN NaN [128] 173.40 United Arab Emirates
101 2314307 Asia Qatar QAT 334500.0 POLYGON ((50.81011 24.75474, 50.74391 25.48242... QAT Qatar 170.8 cm (5 ft 7 in) 161.1 cm (5 ft 3 1⁄2 in) 1.06 18 1.9% Measured 2005 [21][96] 170.80 Qatar
103 39192111 Asia Iraq IRQ 596700.0 POLYGON ((39.19547 32.16101, 38.79234 33.37869... IRQ Iraq – Baghdad 165.4 cm (5 ft 5 in) 155.8 cm (5 ft 1 1⁄2 in) 1.06 18–44 (N= m:700 f:800, SD= m:5.6 cm (2 in) f:1... 76.3% Measured 1999–2000 [61] 165.40 Iraq
107 68414135 Asia Thailand THA 1161000.0 POLYGON ((105.21878 14.27321, 104.28142 14.416... THA Thailand 170.3 cm (5 ft 7 in) 159 cm (5 ft 2 1⁄2 in) 1.07 STOU students, 15–19 (N= m:839 f:1,636, SD= m:... 0.2%[122] Self-reported 2005 [123] 170.30 Thailand
110 96160163 Asia Vietnam VNM 594900.0 POLYGON ((104.33433 10.48654, 105.19991 10.889... VNM Vietnam 162.1 cm (5 ft 4 in) 152.2 cm (5 ft 0 in) 1.07 25–29 (SD= m:5.39 cm (2 in) f:5.39 cm (2 in)) 15.9% Measured 1992–1993 [49] 162.10 Vietnam
111 96160163 Asia Vietnam VNM 594900.0 POLYGON ((104.33433 10.48654, 105.19991 10.889... VNM Vietnam 165.7 cm (5 ft 5 in) 155.2 cm (5 ft 1 in) 1.07 Students, 20–25 (N= m:1,000 f:1,000, SD= m:6.5... 2.0%[136] Measured 2006–2007 [137] 165.70 Vietnam
112 25248140 Asia North Korea PRK 40000.0 MULTIPOLYGON (((130.78000 42.22001, 130.78000 ... North Korea North Korea 165.6 cm (5 ft 5 in) 154.9 cm (5 ft 1 in) 1.07 Defectors, 20–39 (N= m/f:1,075) 46.4% Measured 2005 [70] 165.60 North Korea
113 51181299 Asia South Korea KOR 1929000.0 POLYGON ((126.17476 37.74969, 126.23734 37.840... South Korea South Korea 170.7 cm (5 ft 7 in) 157.4 cm (5 ft 2 in) 1.08 20+ (N= m:2,750 f:2,445, Median= m:170.7 cm (5... 96.5% Measured 2010 [71] 170.70 South Korea
114 51181299 Asia South Korea KOR 1929000.0 POLYGON ((126.17476 37.74969, 126.23734 37.840... South Korea South Korea 173.5 cm (5 ft 8 1⁄2 in) NaN NaN Conscripts, 18–19 (N= m:323,800) 3.8% Measured 2017 [72] 173.50 South Korea
116 3068243 Asia Mongolia MNG 37000.0 POLYGON ((87.75126 49.29720, 88.80557 49.47052... MNG Mongolia 168.4 cm (5 ft 6 1⁄2 in) 157.7 cm (5 ft 2 in) 1.07 25–34 (N= m:158 f:181) 27.6% Measured 2006 [84] 168.40 Mongolia
117 1281935911 Asia India IND 8721000.0 POLYGON ((97.32711 28.26158, 97.40256 27.88254... IND India – Urban 174.3 cm (5 ft 8 1⁄2 in) 158.5 cm (5 ft 2 1⁄2 in) 1.10 Private school students, 18 (N= m:34,411 f:30,... NaN Measured 2011 [55] 174.30 India
118 1281935911 Asia India IND 8721000.0 POLYGON ((97.32711 28.26158, 97.40256 27.88254... IND India – Rural 161.5 cm (5 ft 3 1⁄2 in) 152.5 cm (5 ft 0 in) 1.06 17 (SD= m:7.0 cm (3 in) f:6.3 cm (2 1⁄2 in)) NaN Measured 2002 [56] 161.50 India
119 1281935911 Asia India IND 8721000.0 POLYGON ((97.32711 28.26158, 97.40256 27.88254... IND India 164.7 cm (5 ft 5 in) 152.6 cm (5 ft 0 in) 1.08 20–49 (N= m:69,245 f:118,796) 44.3% Measured 2005-2006 [57] 164.70 India
120 1281935911 Asia India IND 8721000.0 POLYGON ((97.32711 28.26158, 97.40256 27.88254... IND India – Patiala, Punjab 177.3 cm (5 ft 10 in) NaN NaN Students, Punjabi, 18-25 (N: 149, SD = 7.88 cm... 22.4% Measured 2013 [58] 177.30 India
123 29384297 Asia Nepal NPL 71520.0 POLYGON ((88.12044 27.87654, 88.04313 27.44582... NPL Nepal 163.0 cm (5 ft 4 in) 150.8 cm (4 ft 11 1⁄2 in) NaN 25–49 (N= f:6,280, SD= f:5.5 cm (2 in)) 52.9% Self-reported 2006 [15] 163.00 Nepal
129 82021564 Asia Iran IRN 1459000.0 POLYGON ((48.56797 29.92678, 48.01457 30.45246... Iran Iran 170.3 cm (5 ft 7 in) 157.2 cm (5 ft 2 in) 1.08 21+ (N= m/f:89,532, SD= m:8.05 cm (3 in) f:7.2... 88.1% Measured 2005 [60] 170.30 Iran
132 9960487 Europe Sweden SWE 498100.0 POLYGON ((11.02737 58.85615, 11.46827 59.43239... SWE Sweden 181.5 cm (5 ft 11 1⁄2 in) 166.8 cm (5 ft 5 1⁄2 in) 1.09 20–29 15.6% Measured 2008 [116] 181.50 Sweden
133 9960487 Europe Sweden SWE 498100.0 POLYGON ((11.02737 58.85615, 11.46827 59.43239... SWE Sweden 177.9 cm (5 ft 10 in) 164.6 cm (5 ft 5 in) 1.08 20–74 86.3% Self-reported 1987–1994 [117] 177.90 Sweden
136 38476269 Europe Poland POL 1052000.0 POLYGON ((23.48413 53.91250, 23.52754 53.47012... POL Poland 172.2 cm (5 ft 8 in) 159.4 cm (5 ft 3 in) 1.07 44-69 (N= m:4336 f: 4559) 39.4% Measured 2007 [93] 172.20 Poland
137 38476269 Europe Poland POL 1052000.0 POLYGON ((23.48413 53.91250, 23.52754 53.47012... POL Poland 178.7 cm (5 ft 10 1⁄2 in) 165.1 cm (5 ft 5 in) 1.08 18 (N= m:846 f:1,126) 1.6% Measured 2010 [94] 178.70 Poland
138 8754413 Europe Austria AUT 416600.0 POLYGON ((16.97967 48.12350, 16.90375 47.71487... AUT Austria 179 cm (5 ft 10 1⁄2 in) 166 cm (5 ft 5 1⁄2 in) 1.08 20–49 54.3% Measured 2006 [17] 179.00 Austria
139 9850845 Europe Hungary HUN 267600.0 POLYGON ((22.08561 48.42226, 22.64082 48.15024... HUN Hungary 176 cm (5 ft 9 1⁄2 in) 164 cm (5 ft 4 1⁄2 in) 1.07 Adults NaN Measured 2000s [53] 176.00 Hungary
140 9850845 Europe Hungary HUN 267600.0 POLYGON ((22.08561 48.42226, 22.64082 48.15024... HUN Hungary 177.3 cm (5 ft 10 in) NaN NaN 18 (N= m:1,080, SD= m:5.99 cm (2 1⁄2 in)) 1.7% Measured 2005 [54] 177.30 Hungary
142 21529967 Europe Romania ROU 441000.0 POLYGON ((28.23355 45.48828, 28.67978 45.30403... ROU Romania 172 cm (5 ft 7 1⁄2 in) 157 cm (5 ft 2 in) 1.10 NaN NaN Measured 2007 [97] 172.00 Romania
143 2823859 Europe Lithuania LTU 85620.0 POLYGON ((26.49433 55.61511, 26.58828 55.16718... LTU Lithuania – Urban 178.4 cm (5 ft 10 in) NaN NaN Conscripts, 19–25 (N= m:91 SD= m:6.7 cm (2 1⁄2... 9.9% Measured 2005[75] [76] 178.40 Lithuania
144 2823859 Europe Lithuania LTU 85620.0 POLYGON ((26.49433 55.61511, 26.58828 55.16718... LTU Lithuania – Rural 176.2 cm (5 ft 9 1⁄2 in) NaN NaN Conscripts, 19–25 (N= m:106 SD= m:5.9 cm (2 1⁄... 4.9% Measured 2005[75] [76] 176.20 Lithuania
145 2823859 Europe Lithuania LTU 85620.0 POLYGON ((26.49433 55.61511, 26.58828 55.16718... LTU Lithuania 181.3 cm (5 ft 11 1⁄2 in) 167.5 cm (5 ft 6 in) 1.08 18 2.1% Measured 2001 [77] 181.30 Lithuania
147 1251581 Europe Estonia EST 38700.0 POLYGON ((27.98113 59.47537, 27.98112 59.47537... EST Estonia 179.1 cm (5 ft 10 1⁄2 in) NaN NaN 17 2.3% Measured 2003 [42] 179.10 Estonia
148 80594017 Europe Germany DEU 3979000.0 POLYGON ((14.11969 53.75703, 14.35332 53.24817... DEU Germany 175.4 cm (5 ft 9 in) 162.8 cm (5 ft 4 in) 1.08 18–79 (N= m/f:19,768) 94.3% Measured 2007 [6] 175.40 Germany
149 80594017 Europe Germany DEU 3979000.0 POLYGON ((14.11969 53.75703, 14.35332 53.24817... DEU Germany 178 cm (5 ft 10 in) 165 cm (5 ft 5 in) 1.08 18+ (N= m:25,112 f:25,560) 100.0% Self-reported 2009 [48] 178.00 Germany
150 7101510 Europe Bulgaria BGR 143100.0 POLYGON ((22.65715 44.23492, 22.94483 43.82379... BGR Bulgaria 175.2 cm (5 ft 9 in) 163.2 cm (5 ft 4 1⁄2 in) 1.07 NaN NaN NaN 2010 [27] 175.20 Bulgaria
151 10768477 Europe Greece GRC 290500.0 MULTIPOLYGON (((26.29000 35.29999, 26.16500 35... GRC Greece 177 cm (5 ft 9 1⁄2 in) 165 cm (5 ft 5 in) 1.07 18–49 56.3% Measured 2003 [17] 177.00 Greece
152 80845215 Asia Turkey TUR 1670000.0 MULTIPOLYGON (((44.77268 37.17044, 44.29345 37... TUR Turkey 173.6 cm (5 ft 8 1⁄2 in) 161.9 cm (5 ft 3 1⁄2 in) 1.07 20-22 (N= m:322 f:247) 8.3% Measured 2007 [11][21][125] 173.60 Turkey
153 80845215 Asia Turkey TUR 1670000.0 MULTIPOLYGON (((44.77268 37.17044, 44.29345 37... TUR Turkey – Ankara 174.1 cm (5 ft 8 1⁄2 in) 158.9 cm (5 ft 2 1⁄2 in) 1.10 18–59 (N= m:703 f:512, Median= m:169.7 cm (5 f... 5.1%[126] Measured 2004–2006 [127] 174.10 Turkey
155 3047987 Europe Albania ALB 33900.0 POLYGON ((21.02004 40.84273, 20.99999 40.58000... ALB Albania 174.0 cm (5 ft 8 1⁄2 in) 161.8 cm (5 ft 3 1⁄2 in) 1.08 20–29 (N= m:649 f:1,806) 23.5% Measured 2008–2009 [11][12] 174.00 Albania
156 4292095 Europe Croatia HRV 94240.0 POLYGON ((16.56481 46.50375, 16.88252 46.38063... HRV Croatia 180.4 cm (5 ft 11 in) 166.49 cm (5 ft 5 1⁄2 in) 1.09 18 (N= m:358 f:360, SD= m:6.8 cm (2 1⁄2 in) f:... 1.6% Measured 2006–2008 [34] 180.40 Croatia
157 8236303 Europe Switzerland CHE 496300.0 POLYGON ((9.59423 47.52506, 9.63293 47.34760, ... CHE Switzerland 178.2 cm (5 ft 10 in) NaN NaN Conscripts, 19 (N= m:12,447, Median= m:178.0 c... 1.5% Measured 2009 [118] 178.20 Switzerland
158 8236303 Europe Switzerland CHE 496300.0 POLYGON ((9.59423 47.52506, 9.63293 47.34760, ... CHE Switzerland 175.4 cm (5 ft 9 in) 164 cm (5 ft 4 1⁄2 in) 1.07 20–74 88.8% Self-reported 1987–1994 [117] 175.40 Switzerland
160 11491346 Europe Belgium BEL 508600.0 POLYGON ((6.15666 50.80372, 6.04307 50.12805, ... BEL Belgium 178.6 cm (5 ft 10 1⁄2 in) 168.1 cm (5 ft 6 in) 1.06 21 (N= m:20–49 f:20–49, SD= m:6.6 cm (2 1⁄2 in... 1.7% Self-reported 2001 [22] 178.60 Belgium
161 17084719 Europe Netherlands NLD 870800.0 POLYGON ((6.90514 53.48216, 7.09205 53.14404, ... NLD Netherlands 180.8 cm (5 ft 11 in) 167.5 cm (5 ft 6 in) 1.08 20+ 96.8% Self-reported 2013 [9][26][86] 180.80 Netherlands
162 10839514 Europe Portugal PRT 297100.0 POLYGON ((-9.03482 41.88057, -8.67195 42.13469... PRT Portugal 173.9 cm (5 ft 8 1⁄2 in) NaN NaN 18 (N= m:696) 1.5% Measured 2008 [11][95] 173.90 Portugal
163 10839514 Europe Portugal PRT 297100.0 POLYGON ((-9.03482 41.88057, -8.67195 42.13469... PRT Portugal 171 cm (5 ft 7 1⁄2 in) 161 cm (5 ft 3 1⁄2 in) 1.06 20–50 56.7% Self-reported 2001 [17] 171.00 Portugal
164 10839514 Europe Portugal PRT 297100.0 POLYGON ((-9.03482 41.88057, -8.67195 42.13469... PRT Portugal 173.7 cm (5 ft 8 1⁄2 in) 163.7 cm (5 ft 4 1⁄2 in) 1.06 21 (N= m:87 f:106, SD= m:8.2 cm (3 in) f:5.3 c... 1.9% Self-reported 2001 [22] 173.70 Portugal
165 48958159 Europe Spain ESP 1690000.0 POLYGON ((-7.45373 37.09779, -7.53711 37.42890... ESP Spain 173.1 cm (5 ft 8 in) NaN NaN 18–70 (N= m:1,298 [s][112] ) 88.2% Measured 2013–2014 [113][114] 173.10 Spain
167 48958159 Europe Spain ESP 1690000.0 POLYGON ((-7.45373 37.09779, -7.53711 37.42890... ESP Spain 174 cm (5 ft 8 1⁄2 in) 163 cm (5 ft 4 in) 1.07 20–49 57.0% Self-reported 2007 [17] 174.00 Spain
168 5011102 Europe Ireland IRL 322000.0 POLYGON ((-6.19788 53.86757, -6.03299 53.15316... IRL Ireland 177 cm (5 ft 9 1⁄2 in) 163 cm (5 ft 4 in) 1.09 20–49 61.8% Measured 2007 [17] 177.00 Ireland
169 5011102 Europe Ireland IRL 322000.0 POLYGON ((-6.19788 53.86757, -6.03299 53.15316... IRL Ireland 179 cm (5 ft 10 1⁄2 in) 165 cm (5 ft 5 in) 1.08 18 - Measured 2014 [62][63] 179.00 Ireland
172 4510327 Oceania New Zealand NZL 174800.0 MULTIPOLYGON (((176.88582 -40.06598, 176.50802... NZL New Zealand 177 cm (5 ft 9 1⁄2 in) 164 cm (5 ft 4 1⁄2 in) 1.08 20–49 56.9% Measured 2007 [17] 177.00 New Zealand
173 23232413 Oceania Australia AUS 1189000.0 MULTIPOLYGON (((147.68926 -40.80826, 148.28907... AUS Australia 175.6 cm (5 ft 9 in) 161.8 cm (5 ft 3 1⁄2 in) 1.09 18+ 100.0% Measured 2011–2012 [16] 175.60 Australia
174 22409381 Asia Sri Lanka LKA 236700.0 POLYGON ((81.78796 7.52306, 81.63732 6.48178, ... LKA Sri Lanka 163.6 cm (5 ft 4 1⁄2 in) 151.4 cm (4 ft 11 1⁄2 in) 1.08 18+ (N= m:1,768 f:2,709, SD= m:6.9 cm (2 1⁄2 i... 100.0% Measured 2005–2006 [111] 163.60 Sri Lanka
175 1379302771 Asia China CHN 21140000.0 MULTIPOLYGON (((109.47521 18.19770, 108.65521 ... CHN China 169.5 cm (5 ft 6 1⁄2 in) 158.0 cm (5 ft 2 in) 1.07 18-69 (N=172,422) 76.8% Measured 2014 [31] 169.50 China
176 1379302771 Asia China CHN 21140000.0 MULTIPOLYGON (((109.47521 18.19770, 108.65521 ... CHN China – Beijing – Urban 175.2 cm (5 ft 9 in) 162.6 cm (5 ft 4 in) 1.08 Urban, 18 (N= m:448 f:405) 0.5% Measured 2011 [32] 175.20 China
177 23508428 Asia Taiwan TWN 1127000.0 POLYGON ((121.77782 24.39427, 121.17563 22.790... TWN Taiwan 171.4 cm (5 ft 7 1⁄2 in) 159.9 cm (5 ft 3 in) 1.07 17 (N= m:200 f:200) 1.7% Measured 2011 [119][120][121] 171.40 Taiwan
178 62137802 Europe Italy ITA 2221000.0 MULTIPOLYGON (((10.44270 46.89355, 11.04856 46... ITA Italy 176.5 cm (5 ft 9 1⁄2 in) 162.5 cm (5 ft 4 in) 1.09 18 1.4% Measured 1999–2004 [11][21][65] 176.50 Italy
179 62137802 Europe Italy ITA 2221000.0 MULTIPOLYGON (((10.44270 46.89355, 11.04856 46... ITA Italy 177.2 cm (5 ft 10 in) 167.8 cm (5 ft 6 in) 1.06 21 (N= m:106 f:92, SD= m:6.0 cm (2 1⁄2 in) f:6... 1.4% Self-reported 2001 [22] 177.20 Italy
180 5605948 Europe Denmark DNK 264800.0 MULTIPOLYGON (((9.92191 54.98310, 9.28205 54.8... DNK Denmark 180.4 cm (5 ft 11 in) 167.2 cm (5 ft 6 in) NaN Conscripts, 18–20 (N= m:38,025) 5.3% Measured 2012 [37] 180.40 Denmark
181 64769452 Europe United Kingdom GBR 2788000.0 MULTIPOLYGON (((-6.19788 53.86757, -6.95373 54... GBR United Kingdom – England 175.3 cm (5 ft 9 in) 161.9 cm (5 ft 3 1⁄2 in) 1.08 16+ (N= m:3,154 f:3,956) 103.2%[129] Measured 2012 [5] 175.30 United Kingdom
182 64769452 Europe United Kingdom GBR 2788000.0 MULTIPOLYGON (((-6.19788 53.86757, -6.95373 54... GBR United Kingdom – Scotland 175.0 cm (5 ft 9 in) 161.3 cm (5 ft 3 1⁄2 in) 1.08 16+ (N= m:2,512 f:3,180, Median= m:174.8 cm (5... 103.0%[129] Measured 2008 [130] 175.00 United Kingdom
183 64769452 Europe United Kingdom GBR 2788000.0 MULTIPOLYGON (((-6.19788 53.86757, -6.95373 54... GBR United Kingdom – Wales 177.0 cm (5 ft 9 1⁄2 in) 162.0 cm (5 ft 4 in) 1.09 16+ 103.2%[129] Self-reported 2009 [131] 177.00 United Kingdom
184 339747 Europe Iceland ISL 16150.0 POLYGON ((-14.50870 66.45589, -14.73964 65.808... ISL Iceland 181 cm (5 ft 11 1⁄2 in) 168 cm (5 ft 6 in) 1.08 20–49 43.6% Self-reported 2007 [17] 181.00 Iceland
185 9961396 Asia Azerbaijan AZE 167900.0 MULTIPOLYGON (((46.40495 41.86068, 46.68607 41... AZE Azerbaijan 171.8 cm (5 ft 7 1⁄2 in) 165.4 cm (5 ft 5 in) 1.04 16+ 106.5% Measured 2005 [18] 171.80 Azerbaijan
187 104256076 Asia Philippines PHL 801900.0 MULTIPOLYGON (((120.83390 12.70450, 120.32344 ... PHL Philippines 163.5 cm (5 ft 4 1⁄2 in) 151.8 cm (5 ft 0 in) 1.08 20–39 31.5%[91] Measured 2003 [92] 163.50 Philippines
188 31381992 Asia Malaysia MYS 863000.0 MULTIPOLYGON (((100.08576 6.46449, 100.25960 6... MYS Malaysia 166.3 cm (5 ft 5 1⁄2 in) 154.7 cm (5 ft 1 in) 1.07 Malay, 20–24 (N= m:749 f:893, Median= m:166 cm... 9.7%[79] Measured 1996 [80] 166.30 Malaysia
189 31381992 Asia Malaysia MYS 863000.0 MULTIPOLYGON (((100.08576 6.46449, 100.25960 6... MYS Malaysia 168.5 cm (5 ft 6 1⁄2 in) 158.1 cm (5 ft 2 in) 1.07 Chinese, 20–24 (N= m:407 f:453, Median= m:169 ... 4.1%[79] Measured 1996 [80] 168.50 Malaysia
190 31381992 Asia Malaysia MYS 863000.0 MULTIPOLYGON (((100.08576 6.46449, 100.25960 6... MYS Malaysia 169.1 cm (5 ft 6 1⁄2 in) 155.4 cm (5 ft 1 in) 1.09 Indian, 20–24 (N= m:113 f:140, Median= m:168 c... 1.2%[79] Measured 1996 [80] 169.10 Malaysia
191 31381992 Asia Malaysia MYS 863000.0 MULTIPOLYGON (((100.08576 6.46449, 100.25960 6... MYS Malaysia 163.3 cm (5 ft 4 1⁄2 in) 151.9 cm (5 ft 0 in) 1.08 Other indigenous, 20–24 (N= m:257 f:380, Media... 0.4%[79] Measured 1996 [80] 163.30 Malaysia
193 1972126 Europe Slovenia SVN 68350.0 POLYGON ((13.80648 46.50931, 14.63247 46.43182... SVN Slovenia – Ljubljana 180.3 cm (5 ft 11 in) 167.4 cm (5 ft 6 in) 1.08 19 0.2%[108] Measured 2011 [109] 180.30 Slovenia
194 5491218 Europe Finland FIN 224137.0 POLYGON ((28.59193 69.06478, 28.44594 68.36461... FIN Finland 178.9 cm (5 ft 10 1⁄2 in) 165.3 cm (5 ft 5 in) 1.08 25–34 (N= m/f:2,305) 19.0% Measured 1994 [43] 178.90 Finland
195 5491218 Europe Finland FIN 224137.0 POLYGON ((28.59193 69.06478, 28.44594 68.36461... FIN Finland 180.7 cm (5 ft 11 in) 167.2 cm (5 ft 6 in) 1.08 −25 (N= m/f:26,636) 9.2% Measured 2010–2011 [43][44] 180.70 Finland
196 5445829 Europe Slovakia SVK 168800.0 POLYGON ((22.55814 49.08574, 22.28084 48.82539... SVK Slovakia 179.4 cm (5 ft 10 1⁄2 in) 165.6 cm (5 ft 5 in) 1.08 18 2.0% Measured 2004 [107] 179.40 Slovakia
197 10674723 Europe Czechia CZE 350900.0 POLYGON ((15.01700 51.10667, 15.49097 50.78473... CZE Czech Republic 180.3 cm (5 ft 11 in) 167.22 cm (5 ft 6 in) 1.08 17 1.6% Measured 2001 [36] 180.30 Czech Republic
199 126451398 Asia Japan JPN 4932000.0 MULTIPOLYGON (((141.88460 39.18086, 140.95949 ... JPN Japan 172 cm (5 ft 7 1⁄2 in) 158 cm (5 ft 2 in) 1.08 20–49 47.2% Measured 2005 [17] 172.00 Japan
200 126451398 Asia Japan JPN 4932000.0 MULTIPOLYGON (((141.88460 39.18086, 140.95949 ... JPN Japan 172.0 cm (5 ft 7 1⁄2 in) 158.70 cm (5 ft 2 1⁄2 in) 1.08 20–24 (N= m:1,708 f:1,559, SD= m:5.42 cm (2 in... 7.2% Measured 2004 [67] 172.00 Japan
201 126451398 Asia Japan JPN 4932000.0 MULTIPOLYGON (((141.88460 39.18086, 140.95949 ... JPN Japan 170.7 cm (5 ft 7 in) 158.0 cm (5 ft 2 in) 1.08 17 1.2% Measured 2013 [68] 170.70 Japan
204 28571770 Asia Saudi Arabia SAU 1731000.0 POLYGON ((34.95604 29.35655, 36.06894 29.19749... SAU Saudi Arabia 168.9 cm (5 ft 6 1⁄2 in) 156.3 cm (5 ft 1 1⁄2 in) 1.08 18 3.0% Measured 2010 [21][100] 168.90 Saudi Arabia
205 28571770 Asia Saudi Arabia SAU 1731000.0 POLYGON ((34.95604 29.35655, 36.06894 29.19749... SAU Saudi Arabia 174 cm (5 ft 8 1⁄2 in) NaN NaN NaN NaN NaN 2017 [101] 174.00 Saudi Arabia
210 97041072 Africa Egypt EGY 1105000.0 POLYGON ((36.86623 22.00000, 32.90000 22.00000... EGY Egypt 170.3 cm (5 ft 7 in) 158.9 cm (5 ft 2 1⁄2 in) 1.07 20–24 (N= m:845 f:1,059) 16.6% Measured 2008 [41] 170.30 Egypt
220 7111024 Europe Serbia SRB 101800.0 POLYGON ((18.82982 45.90887, 18.82984 45.90888... SRB Serbia 182.0 cm (5 ft 11 1⁄2 in) 166.8 cm (5 ft 5 1⁄2 in) 1.09 Students at UNS,18–30 (N= m:318 f:76, SD= m:6.... 0.7%[102] Measured 2012 [103] 182.00 Serbia
221 642550 Europe Montenegro MNE 10610.0 POLYGON ((20.07070 42.58863, 19.80161 42.50009... MNE Montenegro 183.4 cm (6 ft 0 in) 169.4 cm (5 ft 6 1⁄2 in) 1.09 17-20 (N= m:981 f:1107, SD= m:6.89 cm (2 1⁄2 i... 5.2% Measured 2017 [85] 183.40 Montenegro
222 1895250 Europe Kosovo -99 18490.0 POLYGON ((20.59025 41.85541, 20.52295 42.21787... Kosovo Kosovo – Prishtina 179.52 cm (5 ft 10 1⁄2 in) 165.72 cm (5 ft 5 in) NaN Conscripts, 18-20 (N= m:830 f:793, SD= m:7.02 ... 63.0% Measured 2017 [74] 179.52 Kosovo
Also, notice that we also removed the rows, which has no Average male height.
Step 5: Create the map with our data
Now we have done all the hard work.
It is time to use folium to do the last piece of work.
Let’s put it all together.
import pandas as pd
import numpy as np
import folium
import geopandas
import pycountry
# Helper function to map country names to alpha_3 representation - though some are not known by library
def lookup_country_code(country):
try:
return pycountry.countries.lookup(country).alpha_3
except LookupError:
return country
# The URL we will read our data from
url = 'https://en.wikipedia.org/wiki/Average_human_height_by_country'
# read_html returns a list of tables from the URL
tables = pd.read_html(url)
# The data is in the first table
table = tables[0]
# To avoid writing it all the time
AVG_MH = 'Average male height'
CR = 'Country/Region'
COUNTRY = 'Country'
AMH_F = 'Aveage male height (float)'
A3 = 'alpha3'
# Remove duplicate rows with 'Average male height'
table = table.loc[table[AVG_MH] != AVG_MH].copy()
# Clean up data to have height in cm
table[AMH_F] = table.apply(lambda row: float(row[AVG_MH].split(' ')[0]) if row[AVG_MH] is not np.nan else np.nan,
axis=1)
# Clean up the names if used a dash before
table[COUNTRY] = table.apply(
lambda row: row[CR].split(' – ')[0] if ' – ' in row[CR] else row[CR],
axis=1)
# Map the country name to the alpha3 representation
table[A3] = table.apply(lambda row: lookup_country_code(row[COUNTRY]), axis=1)
# Read the geopandas dataset
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Do the same mapping to alpha3
world[A3] = world.apply(lambda row: lookup_country_code(row['name']), axis=1)
# Merge the data
table = world.merge(table, how="left", left_on=[A3], right_on=[A3])
# Remove countries with no data
table = table.dropna(subset=[AMH_F])
# Creating a map
my_map = folium.Map()
# Adding the data from our table
folium.Choropleth(
geo_data=table,
name='choropleth',
data=table,
columns=[A3, AMH_F],
key_on='feature.properties.alpha3',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Male height'
).add_to(my_map)
# Save the map to an html file
my_map.save('height_map.html')
Which should result in a map like this you can use in your browser. Zoom in and out.