Master Markowitz Portfolio Optimization (Efficient Frontier) in Python using Pandas

What is Markowitz Portfolios Optimization (Efficient Frontier)?

The Efficient Frontier takes a portfolio of investments and optimizes the expected return in regards to the risk. That is to find the optimal return for a risk.

According to investopedia.org the return is based on the expected Compound Annual Growth Rate (CAGR) and risk metric is the standard deviation of the return.

But what does all that mean? We will learn that in this tutorial.

Step 1: Get the time series of your stock portfolio

We will use the following portfolio of 4 stocks of Apple (AAPL), Microsoft (MSFT), IBM (IBM) and Nvidia (NVDA).

To get the time series we will use the Yahoo! Finance API through the Pandas-datareader.

We will look 5 years back.

import pandas_datareader as pdr
import pandas as pd
import datetime as dt
from dateutil.relativedelta import relativedelta
years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL','MSFT','IBM','NVDA']
for ticker in tickers:
  tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
  close_price[ticker] = tmp['Close']
print(close_price)

Resulting in the following output (or the first few lines).

                  AAPL        MSFT         IBM        NVDA
Date                                                      
2015-08-25  103.739998   40.470001  140.960007   20.280001
2015-08-26  109.690002   42.709999  146.699997   21.809999
2015-08-27  112.919998   43.900002  148.539993   22.629999
2015-08-28  113.290001   43.930000  147.979996   22.730000
2015-08-31  112.760002   43.520000  147.889999   22.480000

It will contain all the date time series for the last 5 years from current date.

Step 2: Calculate the CAGR, returns, and covariance

To calculate the expected return, we use the Compound Average Growth Rate (CAGR) based on the last 5 years. The CAGR is used as investopedia suggest. An alternative that also is being used is the mean of the returns. The key thing is to have some common measure of the return.

The CAGR is calculated as follows.

CAGR = (end-price/start-price)^(1/years) – 1

We will also calculate the covariance as we will use that the calculate the variance of a weighted portfolio. Remember that the standard deviation is given by the following.

sigma = sqrt(variance)

A portfolio is a vector w with the balances of each stock. For example, given w = [0.2, 0.3, 0.4, 0.1], will say that we have 20% in the first stock, 30% in the second, 40% in the third, and 10% in the final stock. It all sums up to 100%.

Given a weight w of the portfolio, you can calculate the variance of the stocks by using the covariance matrix.

variance = w^T Cov w

Where Cov is the covariance matrix.

This results in the following pre-computations.

returns = close_price/close_price.shift(1)
cagr = (close_price.iloc[-1]/close_price.iloc[0])**(1/years) - 1
cov = returns.cov()
print(cagr)
print(cov)

Where you can see the output here.

# CACR:
AAPL    0.371509
MSFT    0.394859
IBM    -0.022686
NVDA    0.905011
dtype: float64
# Covariance
          AAPL      MSFT       IBM      NVDA
AAPL  0.000340  0.000227  0.000152  0.000297
MSFT  0.000227  0.000303  0.000164  0.000306
IBM   0.000152  0.000164  0.000260  0.000210
NVDA  0.000297  0.000306  0.000210  0.000879

Step 3: Plot the return and risk

This is where the power of computing comes into the picture. The idea is to just try a random portfolio and see how it rates with regards to expected return and risk.

It is that simple. Make a random weighted distribution of your portfolio and plot the point of expected return (based on our CAGR) and the risk based on the standard deviation calculated by the covariance.

import matplotlib.pyplot as plt
import numpy as np
def random_weights(n):
    k = np.random.rand(n)
    return k / sum(k)
exp_return = []
sigma = []
for _ in range(20000):
  w = random_weights(len(tickers))
  exp_return.append(np.dot(w, cagr.T))
  sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))
plt.plot(sigma, exp_return, 'ro', alpha=0.1) 
plt.show()

We introduce a helper function random_weights, which returns a weighted portfolio. That is, it returns a vector with entries that sum up to one. This will give a way to distribute our portfolio of stocks.

Then we iterate 20.000 times (could be any value, just want to have enough to plot our graph), where we make a random weight w, then calculate the expected return by the dot-product of w and cagr-transposed. This is done by using NumPy’s dot-product function.

What a dot-product of np.dot(w, cagr.T) does is to take elements pairwise from w and cagr and multiply them and sum up. The transpose is only about the orientation of it to make it work.

The standard deviation (assigned to sigma) is calculated similar by the formula given in the last step: variance = w^T Cov w (which has dot-products between).

This results in the following graph.

Returns vs risks

This shows a graph which outlines a parabola. The optimal values lie along the upper half of the parabola line. Hence, given a risk, the optimal portfolio is one corresponding on the upper boarder of the filled parabola.

Considerations

The Efficient Frontier gives you a way to balance your portfolio. The above code can by trial an error find such a portfolio, but it still leaves out some consideratoins.

How often should you re-balance? It has a cost to do that.

The theory behind has some assumptions that may not be a reality. As investopedia points out, it assumes that asset returns follow a normal distribution, but in reality returns can be more the 3 standard deviations away. Also, the theory builds upon that investors are rational in their investment, which is by most considered a flawed assumption, as more factors play into the investments.

The full source code

Below here you find the full source code from the tutorial.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd
from dateutil.relativedelta import relativedelta
import matplotlib.pyplot as plt
import numpy as np

years = 5
end_date = dt.datetime.now()
start_date = end_date - relativedelta(years=years)
close_price = pd.DataFrame()
tickers = ['AAPL', 'MSFT', 'IBM', 'NVDA']
for ticker in tickers:
    tmp = pdr.get_data_yahoo(ticker, start_date, end_date)
    close_price[ticker] = tmp['Close']
returns = close_price / close_price.shift(1)
cagr = (close_price.iloc[-1] / close_price.iloc[0]) ** (1 / years) - 1
cov = returns.cov()
def random_weights(n):
    k = np.random.rand(n)
    return k / sum(k)
exp_return = []
sigma = []
for _ in range(20000):
    w = random_weights(len(tickers))
    exp_return.append(np.dot(w, cagr.T))
    sigma.append(np.sqrt(np.dot(np.dot(w.T, cov), w)))
plt.plot(sigma, exp_return, 'ro', alpha=0.1)
plt.show()

Pandas and Finance: Make Candlestick Plot

What will we cover in this tutorial?

A quick way to make a candlestick plot using the mplfinance library on financial data in Pandas DataFrames.

The code

It is straight forward to achieve by using the new matplotlib finance API. The data can be collected by using Pandas-datareader with the open Yahoo! Finance API.

import pandas_datareader as pdr
import datetime as dt
import mplfinance as mpf

df = pdr.get_data_yahoo("AAPL", dt.datetime(2020,6,1), dt.datetime.now())
mpf.plot(df, type='candle', style='charles',
            title='Apple',
            ylabel='Price',
            ylabel_lower='Volume',
            volume=True,
            mav=(1,3,6))

The mav-argument is the Moving Averages, I have also included the 1, which is the actual price.

Visualize Inflation for 2019 using Pandas-datareader and GeoPandas

What will we cover in this tutorial?

In this tutorial we will visualize the inflation on a map. This will be done by getting the inflation data directly from World Bank using the Pandas-datareader. This data will be joined with data from GeoPandas, which provides a world map we can use to create a Choropleth map.

The end result

Step 1: Retrieve the inflation data from World Bank

The Pandas-datareader has an interface to get data from World Bank. To find interesting data from World Bank you should explore data.worldbank.org, which contains various interesting indicators.

When you find one, like the Inflation, consumer prices (annual %), we will use, you can see that you can download it in CSV, XML, or excel. But we are not old fashioned, hence, we will use the direct API to get fresh data every time we run our program.

To use the API, we need the indicator, which you will find in the url. In this case.

https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG

Hence we have it FP.CPI.TOTL.ZG.

Using the Pandas-datareader API you can get the data by running the following piece of code.

from pandas_datareader import wb
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
print(data)

If you inspect the output, you will see it is structured a bit inconvenient.

                                                         FP.CPI.TOTL.ZG
country                                            year                
Arab World                                         2019        1.336016
Caribbean small states                             2019             NaN
Central Europe and the Baltics                     2019        2.664561
Early-demographic dividend                         2019        3.030587
East Asia & Pacific                                2019        1.773102
East Asia & Pacific (excluding high income)        2019        2.779172
East Asia & Pacific (IDA & IBRD countries)         2019        2.779172

It has two indexes.

We want to reset index 1 (the year) and, which will make year to a column. Then for convenience we should rename the columns.

from pandas_datareader import wb
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
print(data)

Resulting in the following.

                                                    year  inflation
country                                                            
Arab World                                          2019   1.336016
Caribbean small states                              2019        NaN
Central Europe and the Baltics                      2019   2.664561
Early-demographic dividend                          2019   3.030587
East Asia & Pacific                                 2019   1.773102
East Asia & Pacific (excluding high income)         2019   2.779172
East Asia & Pacific (IDA & IBRD countries)          2019   2.779172

Step 2: Retrieve the world map data

The world map data is available from GeoPandas. At first glance everything is easy.

import geopandas
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
print(map)

Where I excluded Antarctica for visual purposes. Inspecting some of the output.

        pop_est                continent                      name iso_a3   gdp_md_est                                           geometry
0        920938                  Oceania                      Fiji    FJI      8374.00  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1      53950935                   Africa                  Tanzania    TZA    150600.00  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2        603253                   Africa                 W. Sahara    ESH       906.50  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3      35623680            North America                    Canada    CAN   1674000.00  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4     326625791            North America  United States of America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
5      18556698                     Asia                Kazakhstan    KAZ    460700.00  POLYGON ((87.35997 49.21498, 86.59878 48.54918...
6      29748859                     Asia                Uzbekistan    UZB    202300.00  POLYGON ((55.96819 41.30864, 55.92892 44.99586...

It seems to be a good match to join the data on the name column.

To make it easy, we can make the name column index.

import geopandas
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')

Step 3: Joining the datasets

This is the fun part of Data Science. Why? I am glad you asked. Well, it was an irony. The challenge will be apparent in a moment. There are various ways to deal with it, but in this tutorial we will use a simplistic approach.

Let us do the join.

from pandas_datareader import wb
import geopandas
pd.set_option('display.width', 3000)
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', 500)
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
map = map.join(data, how='outer')
print(map)

Where I use an outer join, to get all the “challenges” visible.

Russia                                              1.422575e+08                   Europe    RUS   3745000.00  MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...   NaN        NaN
Russian Federation                                           NaN                      NaN    NaN          NaN                                               None  2019   4.470367
...
United States                                                NaN                      NaN    NaN          NaN                                               None  2019   1.812210
United States of America                            3.266258e+08            North America    USA  18560000.00  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...   NaN        NaN

Where I only took two snippets. The key thing is here, that the data from GeoPandas, containing the map, and data from World Bank, containing the inflation rates we want to color the map with, are not joined.

Hence, we need to join United States together with United States of America. And Russia with Russian Federation.

We would use a location service, which maps counties to country codes. Hence, mapping each data sets country names to country codes (note that GeoPandas already has 3 letter country codes, but some are missing, like Norway and more). This approach still can have some missing pieces, as some country names are not known by the mapping.

Another approach is to look find all the data not mapped and rename them in one of the datasets. This can take some time, but I did most of them in the following.

from pandas_datareader import wb
import geopandas
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
index_change = {
    'United States of America': 'United States',
    'Yemen': 'Yemen, Rep.',
    'Venezuela': 'Venezuela, RB',
    'Syria': 'Syrian Arab Republic',
    'Solomon Is.': 'Solomon Islands',
    'Russia': 'Russian Federation',
    'Iran': 'Iran, Islamic Rep.',
    'Gambia': 'Gambia, The',
    'Kyrgyzstan': 'Kyrgyz Republic',
    'Mauritania': 'Mauritius',
    'Egypt': 'Egypt, Arab Rep.'
}
map = map.rename(index=index_change)
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
map = map.join(data, how='outer')

Step 4: Making a Choropleth map based on our dataset

The simple plot of the data will not be very insightful. But let’s try that first.

map.plot('inflation')
plt.title("Inflation 2019")
plt.show()

Resulting in the following.

The default result.

A good way to get inspiration is to check out the documentation with examples.

From the GeoPandas documentation

Where you see a cool color map with scheme=’quantiles’. Let’s try that.

map.plot('inflation', cmap='OrRd', scheme='quantiles')
plt.title("Inflation 2019")
plt.show()

Resulting in the following.

Closer

Adding grey tone to countries not mapped, adding a legend, setting the size. Then we are done. The full source code is here.

from pandas_datareader import wb
import geopandas
import pandas as pd
import matplotlib.pyplot as plt
map = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
map = map[map['name'] != 'Antarctica']
map = map.set_index('name')
index_change = {
    'United States of America': 'United States',
    'Yemen': 'Yemen, Rep.',
    'Venezuela': 'Venezuela, RB',
    'Syria': 'Syrian Arab Republic',
    'Solomon Is.': 'Solomon Islands',
    'Russia': 'Russian Federation',
    'Iran': 'Iran, Islamic Rep.',
    'Gambia': 'Gambia, The',
    'Kyrgyzstan': 'Kyrgyz Republic',
    'Mauritania': 'Mauritius',
    'Egypt': 'Egypt, Arab Rep.'
}
map = map.rename(index=index_change)
data = wb.download(indicator='FP.CPI.TOTL.ZG', country='all', start=2019, end=2019)
data = data.reset_index(1)
data.columns = ['year', 'inflation']
map = map.join(data, how='outer')
map.plot('inflation', cmap='OrRd', scheme='quantiles', missing_kwds={"color": "lightgrey"}, legend=True, figsize=(14,5))
plt.title("Inflation 2019")
plt.show()

Resulting in the following output.

Inflation data from World Bank mapped on a Choropleth map using GeoPandas and MatPlotLib.