How to use Linear Regression to Calculate the Beta to the General Market (S&P 500)

What will we cover?

In this lesson we will learn about Linear Regression, difference from Correlation and how to visualize Linear Regression.

The objective of this tutorial is.

  • Understand the difference between Linear Regression and Correlation.
  • Understand the difference between true random and correlated variables
  • Visualize linear regression.

Step 1: Similarities and differences between linear regression and correlation

Let’s first see what the similarities and difference between Linear Regression and Correlation is.

Similarities.

  • Quantify the direction and strength of the relationship between two variables, here we look at stock prices.

Differences.

  • Correlation is a single statistic. It is just a number between -1 and 1 (both inclusive).
  • Linear regression produces an equation.

Step 2: Visualize data with no correlation

A great way to learn about relationships between variables is to compare it to random variables.

Let’s start by doing that.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib notebook
 
X = np.random.randn(5000)
Y = np.random.randn(5000)
 
fig, ax = plt.subplots()
ax.scatter(X, Y, alpha=.2)

Giving the following scatter chart.

Which shows the how two non-correlated variables look like.

Step 3: How to visualize correlated stock prices

To compare that to two correlated, we need some data.

tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT', '^GSPC']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
log_returns = np.log(data/data.shift())

Let’s make a function to calculate the Liner Regression and visualize it.

def linear_regression(ticker_a, ticker_b):
    X = log_returns[ticker_a].iloc[1:].to_numpy().reshape(-1, 1)
    Y = log_returns[ticker_b].iloc[1:].to_numpy().reshape(-1, 1)
 
    lin_regr = LinearRegression()
    lin_regr.fit(X, Y)
 
    Y_pred = lin_regr.predict(X)
 
    alpha = lin_regr.intercept_[0]
    beta = lin_regr.coef_[0, 0]
 
    fig, ax = plt.subplots()
    ax.set_title("Alpha: " + str(round(alpha, 5)) + ", Beta: " + str(round(beta, 3)))
    ax.scatter(X, Y)
    ax.plot(X, Y_pred, c='r')

The function takes the two tickers and get’s the log returns in NumPy arrays. They are reshaped to fit the required format.

The the Linear Regression model (LinearRegression) is used and applied to predict values. The alpha and beta are the liner variables. Finally, we scatter plot all the points and a prediction line.

Let’s try linear_regression(“AAPL”, “^GSPC”).

Where we see the red line as the prediction line.

Step 4: A few more examples

Other examples linear_regression(“AAPL”, “MSFT”)

And linear_regression(“AAPL”, “TWTR”).

Where it visually shows that AAPL and TWTR are not as closely correlated as the other examples.

Want more?

This is part of 8 lesson and 2.5h video course with prepared Jupyter Notebooks with the Python code.

How to Calculate Correlation between Stock Price Movements with Python

What will we cover?

In this lesson we will learn about correlation of assets, calculations of correlation, and risk and coherence.

The learning objectives of this tutorial.

  • What is correlation and how to use it
  • Calculate correlation
  • Find negatively correlated assets

Step 1: What is Correlation

Correlation is a statistic that measures the degree to which two variables move in relation to each other. Correlation measures association, but doesn’t show if x causes y or vice versa.

The correlation between two stocks is a number form -1 to 1 (both inclusive).

  • A positive correlation means, when stock x goes up, we expect stock y to go up, and opposite.
  • A negative correlation means, when stock x goes up, we expect stock y to go down, and opposite.
  • A zero correlation, we cannot say anything in relation to each other.

The formula for calculating the correlation is quite a mouthful.

Step 2: Calculate the Correlation with DataFrames (pandas)

Luckily, the DataFrames can calculate it for us. Hence, we do not need to master how to do it.

Let’s get started. First, we need to load some time series of historic stock prices.

See this tutorial on how to work with portfolios.

import pandas as pd
import pandas_datareader as pdr
import datetime as dt
import numpy as np
 
tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
 
log_returns = np.log(data/data.shift())

Where we also calculate the log returns.

The correlation can be calculated as follows.

log_returns.corr()

That was easy, right? Remember we do it on the log returns to keep it on the same range.

Symbols AAPL    TWTR    IBM MSFT
Symbols             
AAPL    1.000000    0.531973    0.518204    0.829547
TWTR    0.531973    1.000000    0.386493    0.563909
IBM 0.518204    0.386493    1.000000    0.583205
MSFT    0.829547    0.563909    0.583205    1.000000

We identify, that the correlation on the diagonal is 1.0. This is obvious, since the diagonal shows the correlation between itself (AAPL and AAPL, and so forth).

Other than that, we can conclude that AAPL and MSFT are correlated the most.

Step 3: Calculate the correlation to the general market

Let’s add the S&P 500 to our DataFrame.

sp500 = pdr.get_data_yahoo("^GSPC", start)
 
log_returns['SP500'] = np.log(sp500['Adj Close']/sp500['Adj Close'].shift())
 
log_returns.corr()

Resulting in this.

Where we see that AAPL and MSFT are mostly correlated to S&P 500 index. This is not surprising, as they are a big part of the weight of the market cap in the index.

Step 4: Find Negative Correlated assets when Investing using Python

We will add this helper function to help find correlations.

We are in particular interested in negative correlation here.

def test_correlation(ticker):
    df = pdr.get_data_yahoo(ticker, start)
    lr = log_returns.copy()
    lr[ticker] = np.log(df['Adj Close']/df['Adj Close'].shift())
    return lr.corr()

This can help us find assets with a negative correlation.

Why do we wan that? Well, to minimize the risk. Read my eBook on the subject if you want to learn more about that.

Now, let’s test.

test_correlation("TLT")

Resulting in this following.

The negative correlation we are looking for.

Step 5: Visualize the negative correlation

This can be visualized to get a better understanding as follows.

import matplotlib.pyplot as plt
%matplotlib notebook
 
def visualize_correlation(ticker1, ticker2):
    df = pdr.get_data_yahoo([ticker1, ticker2], start)
    df = df['Adj Close']
    df = df/df.iloc[0]
    fig, ax = plt.subplots()
    df.plot(ax=ax)

With visualize_correlation(“AAPL”, “TLT”) we get.

Where we see, when AAPL goes down, the TLT goes up.

And if we look at visualize_correlation(“^GSPC”, “TLT”) (the S&P 500 index and TLT).

12% Investment Solution

Would you like to get 12% in return of your investments?

D. A. Carter promises and shows how his simple investment strategy will deliver that in the book The 12% Solution. The book shows how to test this statement by using backtesting.

Did Carter find a strategy that will consistently beat the market?

Actually, it is not that hard to use Python to validate his calculations. But we can do better than that. If you want to work smarter than traditional investors then continue to read here.

What next?

Want more?

This is part of a full FREE course with all the code available on my GitHub.

Monte Carlo Simulation to Optimize a Portfolio using Pandas and NumPy

What will we cover?

In this tutorial we will learn about Monte Carlo Simulation. 

First an introduction to the concept and then how to use Sharpe Ratio to find the optimal portfolio with Monte Carlo Simulation.

The learning objective will be.

  • The principles behind Monte Carlo Simulation
  • Applying Monte Carlo Simulation using Sharpe Ratio to get the optimal portfolio
  • Create a visual Efficient Frontier based on Sharpe Ratio

Step 1: What is Monte Carlo Simulation

Monte Carlo Simulation is a great tool to master. It can be used to simulate risk and uncertainty that can affect the outcome of different decision options.

Simply said, if there are too many variables affecting the outcome, then it can simulate them and find the optimal based on the values.

Monte Carlo simulations are used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. It is a technique used to understand the impact of risk and uncertainty in prediction and forecasting models.

https://www.investopedia.com/terms/m/montecarlosimulation.asp

Step 2: A simple example to demonstrate Monte Carlo Simulation

Here we will first use it for simple example, which we can precisely calculate. This is only to get an idea of what Monte Carlo Simulations can do for us.

The game we play.

  • You roll two dice. 
  • When you roll 7, then you gain 5 dollars.
  • If you roll anything else than 7, you lose 1 dollar.

How can we simulate this game?

Well, the roll of two dice can be simulated with NumPy as follows.

import numpy as np
 
def roll_dice():
    return np.sum(np.random.randint(1, 7, 2))

Where are roll is simulated with a call to the roll_dice(). It simply uses the np.random.randint(1, 7, 2), which returns an array of length 2 with 2 integers in the range 1 to 7 (where 7 is not included, but 1 is). The np.sum(…) sums the two integers into the sum of the two simulated dice.

Now to the Monte Carlo Simulation.

This is simply to make a trial run and then see if it is a good game or not.

def monte_carlo_simulation(runs=1000):
    results = np.zeros(2)
    for _ in range(runs):
        if roll_dice() == 7:
            results[0] += 1
        else:
            results[1] += 1
    return results

This is done by keeping track of the how many games I win and lose.

A run could look like this.

monte_carlo_simulation()

It could return array([176., 824.]), which would result in my win of 176*5 = 880 USD and lose of 824 USD. A total gain of 56 USD. 

Each run will most likely give different conclusions.

Step 3: Visualize the result of Monte Carlo Simulation Example

A way to get a more precise picture is to make more runs. Here, we will try to record a series of runs and visualize them.

results = np.zeros(1000)
 
for i in range(1000):
    results[i] = monte_carlo_simulation()[0]
 
import matplotlib.pyplot as plt
%matplotlib notebook
 
fig, ax = plt.subplots()
ax.hist(results, bins=15)

Resulting in this figure.

This gives an idea of how a game of 1000 rolls returns and how volatile it is. See, if the game was less volatile, it would center around one place. 

For these particular runs we have that results.mean()*5 gives the average return of 833.34 USD(notice, you will not get the exact same number due to the randomness involved).

The average loss will be 1000 – results.mean() = 833.332 USD.

This looks like a pretty even game.

Step 4: Making the precise calculation of the example

Can we calculate this exactly?

Yes. The reason is, that this is a simple situation are simulating. When we have more variable (as we will have in a moment with portfolio simulation) it will not be the case.

A nice way to visualize it is as follows.

d1 = np.arange(1, 7)
d2 = np.arange(1, 7)
mat = np.add.outer(d1, d2)

Where the matrix mat looks as follows.

array([[ 2,  3,  4,  5,  6,  7],
       [ 3,  4,  5,  6,  7,  8],
       [ 4,  5,  6,  7,  8,  9],
       [ 5,  6,  7,  8,  9, 10],
       [ 6,  7,  8,  9, 10, 11],
       [ 7,  8,  9, 10, 11, 12]])

The exact probability for rolling 7 is.

mat[mat == 7].size/mat.size

Where we count how many occurrences of 7 divided by the number of possibilities. This gives 0.16666666666666667 or 1/5.

Hence, it seems to be a fair game with no advantage. This is the same we concluded with the Monte Carlo Simulation.

Step 5: Using Monte Carlo Simulation for Portfolio Optimization

Now we have some understanding of Monte Carlo Simulation, we are ready to use it for portfolio optimization.

To do that, we need to read some time series of historic stock prices. See this tutorial to learn more on that.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd
 
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
 
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']

To use it with Sharpe Ratio, we will calculate the log returns.

log_returns = np.log(data/data.shift())

The Monte Carlo Simulations can be done as follows.

# Monte Carlo Simulation
n = 5000
 
weights = np.zeros((n, 4))
exp_rtns = np.zeros(n)
exp_vols = np.zeros(n)
sharpe_ratios = np.zeros(n)
 
for i in range(n):
    weight = np.random.random(4)
    weight /= weight.sum()
    weights[i] = weight
     
    exp_rtns[i] = np.sum(log_returns.mean()*weight)*252
    exp_vols[i] = np.sqrt(np.dot(weight.T, np.dot(log_returns.cov()*252, weight)))
    sharpe_ratios[i] = exp_rtns[i] / exp_vols[i]

The code will run 5000 experiments. We will keep all the data from each run. The weights of the portfolios (weights), the expected return (exp_rtns), the expected volatility (exp_vols) and the Sharpe Ratio (sharpe_ratios).

Then we iterate over the range.

First we create a random portfolio in weight (notice it will have the sum 1). Then we calculate the expected annual return. The expected volatility is calculated a bit different than we learned in the lesson about Sharpe Ratio. This is only to make it perform faster.

Finally, the Sharpe Ratio is calculated.

In this specific run (you might get different values) we get that the maximum Sharpe Ratio is, given by sharpe_ratios.max(), 1.1398396630767385.

To get the optimal weight (portfolio), call weights[sharpe_ratios.argmax()]. In this specific run, array([4.57478167e-01, 6.75247425e-02, 4.74612301e-01, 3.84789577e-04]). This concludes to hold 45.7% to AAPL, 6.7% to MSFT, 47.5% to TWTR, and 0,03% to IBM is optimal.

Step 6: Visualizing the Monte Carlo Simulation of the Efficient Frontier

This can be visualized as follows in an Efficient Frontier.

import matplotlib.pyplot as plt
%matplotlib notebook
 
fig, ax = plt.subplots()
ax.scatter(exp_vols, exp_rtns, c=sharpe_ratios)
ax.scatter(exp_vols[sharpe_ratios.argmax()], exp_rtns[sharpe_ratios.argmax()], c='r')
ax.set_xlabel('Expected Volatility')
ax.set_ylabel('Expected Return')

Resulting in this chart.

12% Investment Solution

Would you like to get 12% in return of your investments?

D. A. Carter promises and shows how his simple investment strategy will deliver that in the book The 12% Solution. The book shows how to test this statement by using backtesting.

Did Carter find a strategy that will consistently beat the market?

Actually, it is not that hard to use Python to validate his calculations. But we can do better than that. If you want to work smarter than traditional investors then continue to read here.

Want more?

This is part of a full course on Financial Risk and Return with Pandas and NumPy.

The code is available in the GitHub.