Calculate the CAPM with Python in 3 Easy Steps

What will we cover?

In this lesson we will learn about the CAPM and how to calculate it.

The objectives of this tutorial is:

  • Understand the CAPM (Capital Asset Pricing Model).
  • Beta and CAPM calculations.
  • Expected return of an investment.

Step 1: What is the CAPM?

The CAPM calculates the relationship between systematic risk and expected return. There are several assumptions behind the CAPM formula that have been shown not to hold in reality. But still, the CAPM formula is still widely used.

The formula is as follows.

Step 2: Get some data to make calculations on

Let’s get some data and calculate it.

import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
 
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM', '^GSPC']
start = dt.datetime(2015, 12, 1)
end = dt.datetime(2021, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start, end, interval="m")
 
data = data['Adj Close']
 
log_returns = np.log(data/data.shift())

Feel free to change the tickers to your choice and remember to update the dates to fit your purpose.

Step 3: How to calculate CAPM with Python (NumPy and pandas)

The calculations are done quite easily.

Again, when we look at the formula, the risk free return is often set to 0. Otherwise, the 10 years treasury note is used. Here, we use 1.38%. You can update it for more up to date value with the link.

cov = log_returns.cov()
var = log_returns['^GSPC'].var()
 
beta = cov.loc['AAPL', '^GSPC']/var
 
risk_free_return = 0.0138
market_return = .105
expected_return = risk_free_return + beta*(market_return - risk_free_return)

Notice, you can calculate it all simultaneously.

Want more?

This is part of a 2.5 hour full video course in 8 parts about Risk and Return.

Calculate the market (S&P 500) BETA with Python for any Stock

What will we cover?

In this lesson we will learn about market Beta with S&P 500 index, how to calculate it, and comparison of calculations from last lesson.

The objective of the tutorial is:

  • Understand what market Beta tells you.
  • How to calculate the market (S&P 500) Beta.
  • See how Beta is related with Linear Regression.

Step 1: What is BETA and how to interpret the value

Beta is a measure of a stock’s volatility in relation to the overall market (S&P 500). The S&P 500 index has Beta 1.

High-beta stocks are supposed to be riskier but provide higher potential return. While, low-beta stocks pose less risk but also lower returns.

Interpretation

  • Beta above 1: stock is more volatile than the market, but expects higher return.
  • Beta below 1: stock with lower volatility, and expects less return.

The formula for Beta is Covariance divided by variance.

This sound more scary than it is.

The Beta on financial pages, like Yahoo! Finance, are calculated on the monthly price.

Step 2: Get some historic stock prices with Pandas Datareader

Let’s make an example here.

import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
from sklearn.linear_model import LinearRegression
 
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM', '^GSPC']
start = dt.datetime(2015, 12, 1)
end = dt.datetime(2021, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start, end, interval="m")
 
data = data['Adj Close']
 
log_returns = np.log(data/data.shift())

Where we notice that we read data on interval=”m”, which gives the monthly data.

Step 3: Calculate the BETA

Then the Beta is calculated as follows.

cov = log_returns.cov()
var = log_returns['^GSPC'].var()
 
cov.loc['AAPL', '^GSPC']/var

For Apple, it was 1.25.

If you wonder if it is related to the Beta value from Linear Regression. Let’s check it out.

X = log_returns['^GSPC'].iloc[1:].to_numpy().reshape(-1, 1)
Y = log_returns['AAPL'].iloc[1:].to_numpy().reshape(-1, 1)
 
lin_regr = LinearRegression()
lin_regr.fit(X, Y)
 
lin_regr.coef_[0, 0]

Also giving 1.25. Hence, it is the same calculation behind it.

Want more?

This is part of a 2.5 hours in 8 lessons video course about Risk and Return.

How to use Linear Regression to Calculate the Beta to the General Market (S&P 500)

What will we cover?

In this lesson we will learn about Linear Regression, difference from Correlation and how to visualize Linear Regression.

The objective of this tutorial is.

  • Understand the difference between Linear Regression and Correlation.
  • Understand the difference between true random and correlated variables
  • Visualize linear regression.

Step 1: Similarities and differences between linear regression and correlation

Let’s first see what the similarities and difference between Linear Regression and Correlation is.

Similarities.

  • Quantify the direction and strength of the relationship between two variables, here we look at stock prices.

Differences.

  • Correlation is a single statistic. It is just a number between -1 and 1 (both inclusive).
  • Linear regression produces an equation.

Step 2: Visualize data with no correlation

A great way to learn about relationships between variables is to compare it to random variables.

Let’s start by doing that.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib notebook
 
X = np.random.randn(5000)
Y = np.random.randn(5000)
 
fig, ax = plt.subplots()
ax.scatter(X, Y, alpha=.2)

Giving the following scatter chart.

Which shows the how two non-correlated variables look like.

Step 3: How to visualize correlated stock prices

To compare that to two correlated, we need some data.

tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT', '^GSPC']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
log_returns = np.log(data/data.shift())

Let’s make a function to calculate the Liner Regression and visualize it.

def linear_regression(ticker_a, ticker_b):
    X = log_returns[ticker_a].iloc[1:].to_numpy().reshape(-1, 1)
    Y = log_returns[ticker_b].iloc[1:].to_numpy().reshape(-1, 1)
 
    lin_regr = LinearRegression()
    lin_regr.fit(X, Y)
 
    Y_pred = lin_regr.predict(X)
 
    alpha = lin_regr.intercept_[0]
    beta = lin_regr.coef_[0, 0]
 
    fig, ax = plt.subplots()
    ax.set_title("Alpha: " + str(round(alpha, 5)) + ", Beta: " + str(round(beta, 3)))
    ax.scatter(X, Y)
    ax.plot(X, Y_pred, c='r')

The function takes the two tickers and get’s the log returns in NumPy arrays. They are reshaped to fit the required format.

The the Linear Regression model (LinearRegression) is used and applied to predict values. The alpha and beta are the liner variables. Finally, we scatter plot all the points and a prediction line.

Let’s try linear_regression(“AAPL”, “^GSPC”).

Where we see the red line as the prediction line.

Step 4: A few more examples

Other examples linear_regression(“AAPL”, “MSFT”)

And linear_regression(“AAPL”, “TWTR”).

Where it visually shows that AAPL and TWTR are not as closely correlated as the other examples.

Want more?

This is part of 8 lesson and 2.5h video course with prepared Jupyter Notebooks with the Python code.

How to Calculate Correlation between Stock Price Movements with Python

What will we cover?

In this lesson we will learn about correlation of assets, calculations of correlation, and risk and coherence.

The learning objectives of this tutorial.

  • What is correlation and how to use it
  • Calculate correlation
  • Find negatively correlated assets

Step 1: What is Correlation

Correlation is a statistic that measures the degree to which two variables move in relation to each other. Correlation measures association, but doesn’t show if x causes y or vice versa.

The correlation between two stocks is a number form -1 to 1 (both inclusive).

  • A positive correlation means, when stock x goes up, we expect stock y to go up, and opposite.
  • A negative correlation means, when stock x goes up, we expect stock y to go down, and opposite.
  • A zero correlation, we cannot say anything in relation to each other.

The formula for calculating the correlation is quite a mouthful.

Step 2: Calculate the Correlation with DataFrames (pandas)

Luckily, the DataFrames can calculate it for us. Hence, we do not need to master how to do it.

Let’s get started. First, we need to load some time series of historic stock prices.

See this tutorial on how to work with portfolios.

import pandas as pd
import pandas_datareader as pdr
import datetime as dt
import numpy as np
 
tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
 
log_returns = np.log(data/data.shift())

Where we also calculate the log returns.

The correlation can be calculated as follows.

log_returns.corr()

That was easy, right? Remember we do it on the log returns to keep it on the same range.

Symbols AAPL    TWTR    IBM MSFT
Symbols             
AAPL    1.000000    0.531973    0.518204    0.829547
TWTR    0.531973    1.000000    0.386493    0.563909
IBM 0.518204    0.386493    1.000000    0.583205
MSFT    0.829547    0.563909    0.583205    1.000000

We identify, that the correlation on the diagonal is 1.0. This is obvious, since the diagonal shows the correlation between itself (AAPL and AAPL, and so forth).

Other than that, we can conclude that AAPL and MSFT are correlated the most.

Step 3: Calculate the correlation to the general market

Let’s add the S&P 500 to our DataFrame.

sp500 = pdr.get_data_yahoo("^GSPC", start)
 
log_returns['SP500'] = np.log(sp500['Adj Close']/sp500['Adj Close'].shift())
 
log_returns.corr()

Resulting in this.

Where we see that AAPL and MSFT are mostly correlated to S&P 500 index. This is not surprising, as they are a big part of the weight of the market cap in the index.

Step 4: Find Negative Correlated assets when Investing using Python

We will add this helper function to help find correlations.

We are in particular interested in negative correlation here.

def test_correlation(ticker):
    df = pdr.get_data_yahoo(ticker, start)
    lr = log_returns.copy()
    lr[ticker] = np.log(df['Adj Close']/df['Adj Close'].shift())
    return lr.corr()

This can help us find assets with a negative correlation.

Why do we wan that? Well, to minimize the risk. Read my eBook on the subject if you want to learn more about that.

Now, let’s test.

test_correlation("TLT")

Resulting in this following.

The negative correlation we are looking for.

Step 5: Visualize the negative correlation

This can be visualized to get a better understanding as follows.

import matplotlib.pyplot as plt
%matplotlib notebook
 
def visualize_correlation(ticker1, ticker2):
    df = pdr.get_data_yahoo([ticker1, ticker2], start)
    df = df['Adj Close']
    df = df/df.iloc[0]
    fig, ax = plt.subplots()
    df.plot(ax=ax)

With visualize_correlation(“AAPL”, “TLT”) we get.

Where we see, when AAPL goes down, the TLT goes up.

And if we look at visualize_correlation(“^GSPC”, “TLT”) (the S&P 500 index and TLT).

What next?

Want more?

This is part of a full FREE course with all the code available on my GitHub.

Monte Carlo Simulation to Optimize a Portfolio using Pandas and NumPy

What will we cover?

In this tutorial we will learn about Monte Carlo Simulation. 

First an introduction to the concept and then how to use Sharpe Ratio to find the optimal portfolio with Monte Carlo Simulation.

The learning objective will be.

  • The principles behind Monte Carlo Simulation
  • Applying Monte Carlo Simulation using Sharpe Ratio to get the optimal portfolio
  • Create a visual Efficient Frontier based on Sharpe Ratio

Step 1: What is Monte Carlo Simulation

Monte Carlo Simulation is a great tool to master. It can be used to simulate risk and uncertainty that can affect the outcome of different decision options.

Simply said, if there are too many variables affecting the outcome, then it can simulate them and find the optimal based on the values.

Monte Carlo simulations are used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. It is a technique used to understand the impact of risk and uncertainty in prediction and forecasting models.

https://www.investopedia.com/terms/m/montecarlosimulation.asp

Step 2: A simple example to demonstrate Monte Carlo Simulation

Here we will first use it for simple example, which we can precisely calculate. This is only to get an idea of what Monte Carlo Simulations can do for us.

The game we play.

  • You roll two dice. 
  • When you roll 7, then you gain 5 dollars.
  • If you roll anything else than 7, you lose 1 dollar.

How can we simulate this game?

Well, the roll of two dice can be simulated with NumPy as follows.

import numpy as np
 
def roll_dice():
    return np.sum(np.random.randint(1, 7, 2))

Where are roll is simulated with a call to the roll_dice(). It simply uses the np.random.randint(1, 7, 2), which returns an array of length 2 with 2 integers in the range 1 to 7 (where 7 is not included, but 1 is). The np.sum(…) sums the two integers into the sum of the two simulated dice.

Now to the Monte Carlo Simulation.

This is simply to make a trial run and then see if it is a good game or not.

def monte_carlo_simulation(runs=1000):
    results = np.zeros(2)
    for _ in range(runs):
        if roll_dice() == 7:
            results[0] += 1
        else:
            results[1] += 1
    return results

This is done by keeping track of the how many games I win and lose.

A run could look like this.

monte_carlo_simulation()

It could return array([176., 824.]), which would result in my win of 176*5 = 880 USD and lose of 824 USD. A total gain of 56 USD. 

Each run will most likely give different conclusions.

Step 3: Visualize the result of Monte Carlo Simulation Example

A way to get a more precise picture is to make more runs. Here, we will try to record a series of runs and visualize them.

results = np.zeros(1000)
 
for i in range(1000):
    results[i] = monte_carlo_simulation()[0]
 
import matplotlib.pyplot as plt
%matplotlib notebook
 
fig, ax = plt.subplots()
ax.hist(results, bins=15)

Resulting in this figure.

This gives an idea of how a game of 1000 rolls returns and how volatile it is. See, if the game was less volatile, it would center around one place. 

For these particular runs we have that results.mean()*5 gives the average return of 833.34 USD(notice, you will not get the exact same number due to the randomness involved).

The average loss will be 1000 – results.mean() = 833.332 USD.

This looks like a pretty even game.

Step 4: Making the precise calculation of the example

Can we calculate this exactly?

Yes. The reason is, that this is a simple situation are simulating. When we have more variable (as we will have in a moment with portfolio simulation) it will not be the case.

A nice way to visualize it is as follows.

d1 = np.arange(1, 7)
d2 = np.arange(1, 7)
mat = np.add.outer(d1, d2)

Where the matrix mat looks as follows.

array([[ 2,  3,  4,  5,  6,  7],
       [ 3,  4,  5,  6,  7,  8],
       [ 4,  5,  6,  7,  8,  9],
       [ 5,  6,  7,  8,  9, 10],
       [ 6,  7,  8,  9, 10, 11],
       [ 7,  8,  9, 10, 11, 12]])

The exact probability for rolling 7 is.

mat[mat == 7].size/mat.size

Where we count how many occurrences of 7 divided by the number of possibilities. This gives 0.16666666666666667 or 1/5.

Hence, it seems to be a fair game with no advantage. This is the same we concluded with the Monte Carlo Simulation.

Step 5: Using Monte Carlo Simulation for Portfolio Optimization

Now we have some understanding of Monte Carlo Simulation, we are ready to use it for portfolio optimization.

To do that, we need to read some time series of historic stock prices. See this tutorial to learn more on that.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd
 
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']

To use it with Sharpe Ratio, we will calculate the log returns.

log_returns = np.log(data/data.shift())

The Monte Carlo Simulations can be done as follows.

# Monte Carlo Simulation
n = 5000
 
weights = np.zeros((n, 4))
exp_rtns = np.zeros(n)
exp_vols = np.zeros(n)
sharpe_ratios = np.zeros(n)
 
for i in range(n):
    weight = np.random.random(4)
    weight /= weight.sum()
    weights[i] = weight
     
    exp_rtns[i] = np.sum(log_returns.mean()*weight)*252
    exp_vols[i] = np.sqrt(np.dot(weight.T, np.dot(log_returns.cov()*252, weight)))
    sharpe_ratios[i] = exp_rtns[i] / exp_vols[i]

The code will run 5000 experiments. We will keep all the data from each run. The weights of the portfolios (weights), the expected return (exp_rtns), the expected volatility (exp_vols) and the Sharpe Ratio (sharpe_ratios).

Then we iterate over the range.

First we create a random portfolio in weight (notice it will have the sum 1). Then we calculate the expected annual return. The expected volatility is calculated a bit different than we learned in the lesson about Sharpe Ratio. This is only to make it perform faster.

Finally, the Sharpe Ratio is calculated.

In this specific run (you might get different values) we get that the maximum Sharpe Ratio is, given by sharpe_ratios.max(), 1.1398396630767385.

To get the optimal weight (portfolio), call weights[sharpe_ratios.argmax()]. In this specific run, array([4.57478167e-01, 6.75247425e-02, 4.74612301e-01, 3.84789577e-04]). This concludes to hold 45.7% to AAPL, 6.7% to MSFT, 47.5% to TWTR, and 0,03% to IBM is optimal.

Step 6: Visualizing the Monte Carlo Simulation of the Efficient Frontier

This can be visualized as follows in an Efficient Frontier.

import matplotlib.pyplot as plt
%matplotlib notebook
 
fig, ax = plt.subplots()
ax.scatter(exp_vols, exp_rtns, c=sharpe_ratios)
ax.scatter(exp_vols[sharpe_ratios.argmax()], exp_rtns[sharpe_ratios.argmax()], c='r')
ax.set_xlabel('Expected Volatility')
ax.set_ylabel('Expected Return')

Resulting in this chart.

Want more?

This is part of a full course on Financial Risk and Return with Pandas and NumPy.

The code is available in the GitHub.

How to Calculate Sharpe Ratio with Pandas and NumPy

What will we cover?

In this tutorial we will see how to calculate the Sharpe Ratio using pandas DataFrames and NumPy with Python.

The Sharpe Ratio combines Risk and Return in one number.

The Sharpe Ratio is the average return earned in excess of the risk-free rate per unit of volatility or total risk. Volatility is a measure of the price fluctuations of an asset or portfolio (source).

Step 1: The formula for Sharpe Ratio and how to interpret the result

The Sharpe Ratio is the average return earned in excess of the risk-free rate per unit of volatility or total risk.

The idea with Sharpe Ratio, is to have one number to represent both return and risk. This makes it easy to compare different weights of portfolios. We will use that in the next lesson about Monte Carlo Simulations for Portfolio Optimization.

Now that is a lot of words. How does the Sharpe Ratio look like.

We need the return of the portfolio and the risk free return, as well as the standard deviation of the portfolio.

  • The return of the portfolio we covered in lesson 1, but we will calculate it with log returns here. 
  • It is custom for the risk free return to use the 10 Year Treasury Note, but as it has been low for long time, often 0 is used.
  • The standard deviation is a measure of the volatility, and is used here to represent the risk. This is similar to Average True Range.

Step 2: Get a portfolio of stock prices with Pandas Datareader

To get started, we need to read time series data of historic stock prices for a portfolio. This can be done as follows.

import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
 
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']

Where our portfolio will consist of the tickers for Apple, Microsoft, Twitter and IBM (AAPLMSFTTWTRIBM). We read the data from start 2020 from the Yahoo! Finance API using Pandas Datareader.

Finally, we only keep the Adjusted Close price.

Step 3: Calculate the log-return of the portfolio

Let’s assume our portfolio is balanced as follows, 25%, 15%, 40%, and 20% to AAPLMSFTTWTRIBM, respectively.

Then we can calculate the daily log return of the portfolio as follows.

portfolio = [.25, .15, .40, .20]
log_return = np.sum(np.log(data/data.shift())*portfolio, axis=1)

Where we use the np.log to take the logarithm of the daily change, we apply the portfolio. Finally, we sum (np.sum) along the rows (axis=1).

Step 4: Visualize the log-return of the portfolio

For the fun, we can visualize the daily log returns as follows.

import matplotlib.pyplot as plt
%matplotlib notebook
 
fig, ax = plt.subplots()
log_return.hist(bins=50, ax=ax)

Resulting in this.

This gives an impression of how volatile the portfolio is. The more data is centered around 0.0, the less volatile and risky.

Step 5: Calculate the Sharpe Ratio of the Portfolio

The Sharpe Ratio can be calculate directly as follows.

sharpe_ratio = log_return.mean()/log_return.std()

This gives a daily Sharpe Ratio, where we have the return to be the mean value. That is, the average return of the investment. And divided by the standard deviation. 

The greater is the standard deviation the greater the magnitude of the deviation from the meanvalue can be expected.

To get an annualized Sharpe Ratio.

asr = sharpe_ratio*252**.5

This is the measure we will use in the next lesson, where we will optimize the portfolio using Monte Carlo Simulation.

What next?

If you want to learn more? Then this is one part of a 8 lesson video course.

How to Calculate Volatility as Average True Range (ATR) with Python DataFrames and NumPy

What will we cover?

In this tutorial we will learn about the volatility of a stock price measured with Average True Range. The Volatility of a stock is one measure of investment risk.

The Average True Range was originally designed to calculate the volatility of commodities, but the technical indicator can also be used for stock prices, as we will do in this tutorial.

We will get some stock price time series data and make the calculation using pandas DataFrames and NumPy.

Step 1: Get historic stock price data

To get started, we need some historic stock prices. This can be done as follows and is covered in Lesson 1.

import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd

start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo("NFLX", start)

This reads the time series data of the historic stock prices of Netflix (ticker NFLX).

Step 2: The formula of Average True Range (ATR)

To calculate the Average True Range (ATR) we need a formula, which is given on investoperia.org.

The Average True Range (ATR) is a moving average of the True Range (TR). And the TR is given by the maximum of the current high (H) minus current low (L), the absolute value of current high (H) minus previous close (Cp), and the absolute value of current low (L) and previous close (Cp).

Sted 3: The calculations of Average True Range with DataFrames and NumPy

The above formula can look intimidating at first, but don’t worry, this is where the power of Pandas DataFrames and Series come into the picture.

It is always a good idea to make your calculations simple.

high_low = data['High'] - data['Low']
high_cp = np.abs(data['High'] - data['Close'].shift())
low_cp = np.abs(data['Low'] - data['Close'].shift())

Here we make a Series for each of the values needed. Notice, that we get the previous close by using shift() (data[‘Close’].shift()).

Then a great way to get the maximum value of these is to create a DataFrame with all the values.

df = pd.concat([high_low, high_cp, low_cp], axis=1)
true_range = np.max(df, axis=1)

Now that is nice.

Then we get the ATR as the moving average of 14 days (14 days is the default).

average_true_range = true_range.rolling(14).mean()

Step 4: Visualize the result with Matplotlib

Finally, let’s try to visualize it. Often visualization helps us understand it better.

import matplotlib.pyplot as plt
%matplotlib notebook

fig, ax = plt.subplots()
average_true_range.plot(ax=ax)
ax2 = data['Close'].plot(ax=ax, secondary_y=True, alpha=.3)
ax.set_ylabel("ATR")
ax2.set_ylabel("Price")

Resulting in the following chart.

Want to learn more?

This is part of a full video course consisting of 8 lessons and 2.5 hours of video content.

Get started with Pandas and NumPy for Finance for Risk and Return

What will we cover?

In this part we will get familiar with NumPy. We will assume familiarity with the Pandas library. If you are new to Pandas we will suggest you start with this FREE 2h course. This part will look at how Pandas and NumPy is connected.

In this tutorial we will cover the following.

  • Refresher of working with Pandas and Pandas Datareader to use them to read historic stock prices.
  • How Pandas DataFrame and NumPy arrays are related and different.
  • Calculations of return of a portfolio, which is a primary evaluation factor of an investment.
  • Step 1: Get some data with Pandas Datareader

    First, we need some historic time series stock prices. This can be easily done with Pandas Datareader.

    import numpy as np
    import pandas_datareader as pdr
    import datetime as dt
    import pandas as pd
    
    start = dt.datetime(2020, 1, 1)
    data = pdr.get_data_yahoo("AAPL", start)
    

    This will read historic stock prices from Apple (ticker AAPL) starting from 2020 and up until today. The data is in a DataFrame (Pandas main data structure).

    It is a good habit to verify that the data is as expected to avoid surprises later in the process. That can be done by calling head() on the DataFrame data, which will show the first 5 lines.

    data.head()
    

    Resulting in.

                     High        Low       Open      Close       Volume  Adj Close
    Date                                                                          
    2020-01-02  75.150002  73.797501  74.059998  75.087502  135480400.0  74.333511
    2020-01-03  75.144997  74.125000  74.287498  74.357498  146322800.0  73.610840
    2020-01-06  74.989998  73.187500  73.447502  74.949997  118387200.0  74.197395
    2020-01-07  75.224998  74.370003  74.959999  74.597504  108872000.0  73.848442
    2020-01-08  76.110001  74.290001  74.290001  75.797501  132079200.0  75.036385
    

    Recall that the index should be a DatetimeIndex. This makes it possible to take advantage of being a time series.

    data.index
    

    The above gives the index.

    DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-06', '2020-01-07',
                   '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-13',
                   '2020-01-14', '2020-01-15',
                   ...
                   '2021-03-03', '2021-03-04', '2021-03-05', '2021-03-08',
                   '2021-03-09', '2021-03-10', '2021-03-11', '2021-03-12',
                   '2021-03-15', '2021-03-17'],
                  dtype='datetime64[ns]', name='Date', length=303, freq=None)
    

    To remind ourselves further, we recall that each column in a DataFrame has a datatype.

    data.dtypes
    

    Shown below here.

    High         float64
    Low          float64
    Open         float64
    Close        float64
    Volume       float64
    Adj Close    float64
    dtype: object
    

    Step 2: Investigate how NumPy is different from DataFrames (pandas)

    The next step in our journey is to see how NumPy is different from Pandas DataFrames.

    We can get the DataFrame as a NumPy array as follows.

    arr = data.to_numpy()
    

    The shape of a NumPy array gives the dimensions.

    (303, 6)
    

    Please notice, that you might get more rows than 303, as you run this later than we do here in the tutorial. There will be a row for each day open on the stock exchange market since beginning of 2020.

    But you should get 6 columns, as there are 6 columns in our DataFrame, where the NumPy array comes from.

    The first row of data can be accessed as follows.

    arr[0]
    

    Which gives the the data of the first row, as we know it from the DataFrame.

    [7.51500015e+01 7.37975006e+01 7.40599976e+01 7.50875015e+01
     1.35480400e+08 7.43335114e+01]
    

    Notice the scientific notation. Other than that, you can see the figures are the same.

    Now to an interesting difference from DataFrames. The NumPy array only has one datatype. That means, that all columns have the same datatype. The full array has the same datatype.

    arr.dtype
    

    Resulting in the following output.

    dtype('float64')
    

    To access the top 10 entries of the first column in our NumPy array (the one representing the High column), we can use the following notation.

    small = arr[:10, 0].copy()
    small
    

    Which will output a one-dimensional array of size 10, containing the 10 first values of column 0.

    array([75.15000153, 75.14499664, 74.98999786, 75.22499847, 76.11000061,
           77.60749817, 78.16750336, 79.26750183, 79.39250183, 78.875     ])
    

    Step 3: NumPy functionality

    Some nice functionality to master.

    np.max(small)
    small.max()
    small.argmax()
    

    Where the first two return the maximum value of the array, small. The argmax() returns the index of the maximum value.

    The NumPy functionality works well on DataFrames, which comes in handy when working with financial data.

    We can get the logarithm of values in a NumPy array as follows.

    np.log(small)
    

    Similarly, we can apply the logarithm on all entries in a DataFrame as follows.

    np.log(data)
    

    This is magic.

                    High       Low      Open     Close     Volume  Adj Close
    Date                                                                    
    2020-01-02  4.319486  4.301325  4.304876  4.318654  18.724338   4.308562
    2020-01-03  4.319420  4.305753  4.307943  4.308885  18.801326   4.298792
    2020-01-06  4.317355  4.293025  4.296571  4.316821  18.589471   4.306729
    2020-01-07  4.320484  4.309053  4.316955  4.312107  18.505683   4.302015
    2020-01-08  4.332180  4.307976  4.307976  4.328065  18.698912   4.317973
    

    While the logarithm of all the columns here does not make sense. Later we will use this and it will all make sense.

    Step 4: Calculate the daily return

    We can calculate the daily return as follows.

    data/data.shift()
    

    Resulting in the following output (or first few lines).

                    High       Low      Open     Close    Volume  Adj Close
    Date                                                                   
    2020-01-02       NaN       NaN       NaN       NaN       NaN        NaN
    2020-01-03  0.999933  1.004438  1.003072  0.990278  1.080029   0.990278
    2020-01-06  0.997937  0.987352  0.988693  1.007968  0.809082   1.007968
    2020-01-07  1.003134  1.016157  1.020593  0.995297  0.919626   0.995297
    2020-01-08  1.011765  0.998924  0.991062  1.016086  1.213160   1.016086
    

    Let’s investigate that a bit. Recall the data (you can get the first 5 lines: data.head())

                     High        Low       Open      Close       Volume  Adj Close
    Date                                                                          
    2020-01-02  75.150002  73.797501  74.059998  75.087502  135480400.0  74.333511
    2020-01-03  75.144997  74.125000  74.287498  74.357498  146322800.0  73.610840
    2020-01-06  74.989998  73.187500  73.447502  74.949997  118387200.0  74.197395
    2020-01-07  75.224998  74.370003  74.959999  74.597504  108872000.0  73.848442
    2020-01-08  76.110001  74.290001  74.290001  75.797501  132079200.0  75.036385
    

    Notice the the calculation.

    75.144997/75.150002
    

    Gives.

    0.9999333998687053
    

    Wait. Hence the second row of High divided by the first gives the same value of the second row of data/data.shift().

    This is no coincidence. The line takes each entry in data and divides it with the corresponding entry in data.shift(), and it happens that data.shift() is shifted one forward by date. Hence, it will divide by the previous row.

    Now we understand that, let’s get back to the logarithm. Because, we love log returns. Why? Let’s see this example.

    np.sum(np.log(data/data.shift()))
    

    Giving.

    High         0.502488
    Low          0.507521
    Open         0.515809
    Close        0.492561
    Volume      -1.278826
    Adj Close    0.502653
    dtype: float64
    

    And the following.

    np.log(data/data.iloc[0]).tail(1)
    

    Giving the following.

                    High       Low      Open     Close    Volume  Adj Close
    Date                                                                   
    2021-03-17  0.502488  0.507521  0.515809  0.492561 -1.278826   0.502653
    

    Now why are we so excited about that?

    Well, because we can sum the log daily returns and get the full return. This is really handy when we want to calculate returns of changing portfolios or similar.

    We do not care where the log returns comes from. If our money was invested one day in one portfolio, we get the log return from that. The next day our money is invested in another portfolio. Then we get the log return from that. The sum of those two log returns give the full return.

    That’s the magic.

    Step 5: Reading data from multiple tickers

    We also cover how to reshape data in the video lecture.

    Then we consider how to calculate with portfolio and get the return.

    This requires us to read data from multiple tickers to create a portfolio.

    tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
    start = dt.datetime(2020, 1, 1)
    
    data = pdr.get_data_yahoo(tickers, start)
    

    This gives data in the following format.

    Attributes   Adj Close              ...      Volume           
    Symbols           AAPL        MSFT  ...        TWTR        IBM
    Date                                ...                       
    2020-01-02   74.333511  158.571075  ...  10721100.0  3148600.0
    2020-01-03   73.610840  156.596588  ...  14429500.0  2373700.0
    2020-01-06   74.197395  157.001373  ...  12582500.0  2425500.0
    2020-01-07   73.848442  155.569855  ...  13712900.0  3090800.0
    2020-01-08   75.036385  158.047836  ...  14632400.0  4346000.0
    

    Where the column has two layers of names. First, the attributes then the second layer of the tickers.

    If we only want work with the Adj Close values, which is often the case, we can access them as follows.

    data = data['Adj Close']
    

    Giving data in the following format.

    Symbols           AAPL        MSFT       TWTR         IBM
    Date                                                     
    2020-01-02   74.333511  158.571075  32.299999  126.975204
    2020-01-03   73.610840  156.596588  31.520000  125.962540
    2020-01-06   74.197395  157.001373  31.639999  125.737526
    2020-01-07   73.848442  155.569855  32.540001  125.821907
    2020-01-08   75.036385  158.047836  33.049999  126.872055
    

    Now that is convenient.

    Step 6: Calculate a portfolio holdings

    Now consider a portfolio as follows.

    portfolios = [.25, .15, .40, .20]
    

    That is, 25%, 15%, 40%, and 20% to AAPL, MSFT, TWTR, and IBM, respectively.

    Assume we have 10000 USD to invest as above.

    (data/data.iloc[0])*portfolios*100000
    

    What happened there. Well, first we normalize the data with data/data.iloc[0]. This was covered in the previous course.

    Then we multiply with the portfolio and the amount we invest.

    This result in the following.

    Symbols             AAPL          MSFT          TWTR           IBM
    Date                                                              
    2020-01-02  25000.000000  15000.000000  40000.000000  20000.000000
    2020-01-03  24756.949626  14813.223758  39034.057216  19840.494087
    2020-01-06  24954.221177  14851.514331  39182.662708  19805.051934
    2020-01-07  24836.860500  14716.100112  40297.215708  19818.342892
    2020-01-08  25236.391776  14950.504296  40928.792592  19983.752826
    

    As we can see the first row, this distributes the money as the portfolio is allocated. Then it shows how ti evolves.

    We can get the sum of the full return as follows.

    np.sum((data/data.iloc[0])*portfolios*100000, axis=1)
    

    Where we show the summary here.

    Date
    2020-01-02    100000.000000
    2020-01-03     98444.724688
    2020-01-06     98793.450150
    2020-01-07     99668.519212
    2020-01-08    101099.441489
                      ...      
    2021-03-10    162763.421409
    2021-03-11    168255.248962
    2021-03-12    167440.137240
    2021-03-15    171199.207668
    2021-03-17    169031.577658
    Length: 303, dtype: float64
    

    As you see, we start with 100000 USD and end with 169031 USD in this case. You might get a bit different result, as you run yours on a later day.

    This is handy to explore a portfolio composition.

    Actually, when we get to Monte Carlo Simulation, this will be handy. There, we will generate multiple random portfolios and calculate the return and risk for each of them, to optimize the portfolio composition.

    A random portfolio can be generated as follows with NumPy.

    weight = np.random.random(4)
    weight /= weight.sum()
    

    Notice, that we generate 4 random numbers (one for each ticker) and then we divide by the sum of them. This ensures the sum of the weights will be 1, hence, representing a portfolio.

    This was the first lesson.

    Want more?

    Check out the full 2.5 course with prepared JuPyter Notebooks to follow along.

    Visualize Why Long-term Investing is Less Risky – Pandas and Matplotlib

    What will we cover in this tutorial?

    We will look at how you can use Pandas Datareader (Pandas) and Matplotlib to create a visualization of why long-term investing is less risky.

    Here risk is simply meaning the risk of loosing money.

    Specifically, we will investigate how likely it is to loose money (and how much) if you invest for a 1 year perspective vs a 10 year perspective.

    Step 1: Establish the data for the investigation

    One of the most widely used index is the S&P 500 index. This index lists 500 large companies on the US market exchange and is one of the most commonly followed equity indices.

    We will use this index and retrieve data back from 1970 and up until today.

    This can be done as follow.

    import pandas_datareader as pdr
    from datetime import datetime
    
    data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))
    

    Then the DataFrame data will contain all data from 1970 up until today. The ^GSPC is the ticker for the S&P 500 index.

    Step 2: Calculate the annual return from 1970 and forward using Pandas

    The annual return for a year is calculated by taking the last trading value of the divided by the first day and subtracting 1, then multiply that by 100 to get it in percentage.

    Calculating it for all years then you can visualize it with a histogram as follows.

    import pandas as pd
    import pandas_datareader as pdr
    from datetime import datetime
    import matplotlib.pyplot as plt
    
    
    data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))
    
    years = []
    annual_return = []
    
    for year in range(1970, 2021):
        years.append(year)
        data_year = data.loc[f'{year}']['Adj Close']
        annual_return.append((data_year.iloc[-1] / data_year.iloc[0] - 1) * 100)
    
    df = pd.DataFrame(annual_return, index=years)
    bins = [i for i in range(-40, 45, 5)]
    df.plot.hist(bins=bins, title='1 year')
    plt.show()
    

    Notice that we create a new DataFrame with all the annual returns for each of the years and use it to make a histogram.

    The result is as follows.

    What you see is a histogram indicating how many years a given annual return was occurring.

    Hence, a -40-35% (negative) return occurred once, while a 0-5% return happened 6 times in the span of years from 1970 to 2020 (inclusive).

    What does this tell us?

    Well, you can lose up to 40%, but you can also gain up to 35% in one year. It also shows you that it is more likely to gain (positive return) than lose.

    But what if we invested the money for 10 years.

    Step 3: Calculate the average annual return in 10 years spans starting from 1970 using Pandas

    This is actually quite similar, but with a few changes.

    First of all, the average return is calculated using the CAGR (Compound Annual Growth Rate) formula.

    This results in the following code.

    import pandas as pd
    import pandas_datareader as pdr
    from datetime import datetime
    import matplotlib.pyplot as plt
    
    
    data = pdr.get_data_yahoo('^GSPC', datetime(1970, 1, 1))
    
    years = []
    avg_annual_return = []
    for year in range(1970, 2011):
        years.append(year)
        data_year = data.loc[f'{year}':f'{year + 9}']['Adj Close']
        avg_annual_return.append(((data_year.iloc[-1] / data_year.iloc[0]) ** (1 / 10) - 1) * 100)
    
    df = pd.DataFrame(avg_annual_return, index=years)
    bins = [i for i in range(-40, 45, 5)]
    df.plot.hist(bins=bins, title='10 years')
    plt.show()
    

    There are a few changes. One is the formula for the average annual return (as stated above) and the other is that we use 10 years of data. Notice, that we only add 9 to the year. This is because that both years are inclusive.

    This results in this histogram.

    As you see. One in 3 cases there was a negative return over the a 10 year span. Also, the loss was only in the range -5-0%. Otherwise, the return would be positive.

    Now is that nice?

    The Best Trading Strategy Analyzed with Pandas

    What will we cover in this tutorial?

    Believe it or not? The there are so many promises of the best trading strategies out there.

    In this one we will analyze a promising strategy and see whether it holds or not?

    What is it?

    The bullet proof strategy taken over 20 years span

    This looks amazing, how can it perform this good? Well, simply said, it combines short selling and a safety margin to avoid loosing money.

    On the surface, it seems to take precautions. But let’s try to examine it closer.

    Step 1: Implementing the strategy

    The strategy is to have to Moving Averages a 42 days (MA42) and 252 days (MA252), then a safety margin (or safety distance) involved.

    The strategy is as follows:

    • When MA42 – MA252 > safety distance, then go long.
    • When MA252 – MA52 > safety distance, then go short.

    To analyze and implement the Strategy, we need Pandas_datareader to read the historic ticker prices.

    import pandas_datareader as pdr
    from datetime import datetime
    
    data = pdr.get_data_yahoo('^GSPC', datetime(2000, 1, 1), datetime(2020, 1, 1))
    
    data['MA42'] = data['Adj Close'].rolling(42).mean()
    data['MA252'] = data['Adj Close'].rolling(252).mean()
    
    data['42-252'] = data['MA42'] - data['MA252']
    
    sd = 50
    
    data['Regime'] = np.where(data['42-252'] > sd, 1, 0)
    data['Regime'] = np.where(data['42-252'] < -sd, -1, data['Regime'])
    
    print(data['Regime'].value_counts())
    

    We retrive data for S&P 500 (ticker: ^GSPC) by using pdr.get_data_yahoo(‘^GSPC’, datetime(2000, 1, 1), datetime(2020, 1, 1)) and get 20 years of data.

    As you see, the moving average are calculate with data[‘Adj Close’].rolling(42).mean().

    We use a safety distance (sd) is set to 50. This makes sure that we do not go long or short immediately, but only when we have some distance.

    The Regime is the short and long signal. Short when -1 and long when 1. If 0, do nothing.

    We can see an overview of how often we go long and short.

     1    2608
     0    1422
    -1    1001
    

    Step 2: Test the strategy

    This is straight forward.

    import pandas_datareader as pdr
    import numpy as np
    from datetime import datetime
    import matplotlib.pyplot as plt
    
    
    data = pdr.get_data_yahoo('^GSPC', datetime(2000, 1, 1), datetime(2020, 1, 1))
    
    data['MA42'] = data['Adj Close'].rolling(42).mean()
    data['MA252'] = data['Adj Close'].rolling(252).mean()
    
    data['42-252'] = data['MA42'] - data['MA252']
    
    sd = 50
    
    data['Regime'] = np.where(data['42-252'] > sd, 1, 0)
    data['Regime'] = np.where(data['42-252'] < -sd, -1, data['Regime'])
    
    print(data['Regime'].value_counts())
    
    data['Market'] = np.log(data['Adj Close'] / data['Adj Close'].shift(1))
    
    data['Strategy'] = data['Regime'].shift(1) * data['Market']
    
    data[['Market', 'Strategy']].cumsum().apply(np.exp).plot(grid=True, figsize=(8, 5))
    plt.show()
    

    To test the strategy you need to compare it. We compare it against the general market (The S&P 500).

    We use the log-returns, which can later be accumulated by cumsum and applied with the exponential function to get the result again.

    The Strategy uses the Regime, but shifted one day, as we need a day to react on the price, hence, data[‘Regime’].shift(1) * data[‘Market’].

    Finally, we plot the Strategy against the Market.

    That looks good, right?

    Well, let’s look at it closer.

    Step 3: Deeper analysis of the winning strategy

    First a few notes on short selling in general.

    As noted, the strategy utilizes short selling, which often is not advised for starting investors. First of all, it is often more expensive, as you loan.

    Often there is a fee to short sell, and a interest on top of that. I have seen fees that are quite high and interest on 4% p.a. This makes it less attractive and needs to be considered.

    But there is more, and this is why I always encourage people to make the analysis they see themselves. The above depends on the starting point.

    Let’s take a different period: 2010-2020, then this picture arises.

    The winning strategy does not look winning if you start 2010.

    Really, you might ask?

    Try it yourself. Often these extremely good strategies, looking too good to be true, are only good under certain circumstances. Like, here, where it depends on the starting point.

    Okay, in the past, when you start in 2000, it would have been good. But not from 2010 and forward, then it looks like a loosing strategy.

    What about if you start today?

    If you ask me, this is speculating.

    What is next?

    Want to learn more?

    Check out my popular e-book on backtesting strategies. It will teach how to make better analysis, what to look for, and how to evaluate it. This tutorial only touches a few points.