## Python for Finance: Risk and Return

Learn **Python** for Finance with Risk and Return with **Pandas** and **NumPy** (Python libraries) in this 2.5 hour free 8-lessons online course.

The 8 lessons will get you started with **technical analysis** for Risk and Return using **Python** with **Pandas** and **NumPy**.

**The 8 lessons**

**Lesson 1**: Introduction to**Pandas**and**NumPy**–**Portfolios**and**Returns****Lesson 2**:**Risk**and**Volatility**of a stock –**Average True Range****Lesson 3**:**Risk**and**Return**–**Sharpe Ratio****Lesson 4**:**Monte Carlo Simulation**– Optimize portfolio with Risk and Return**Lesson 5**:**Correlation**– How to balance portfolio with Correlation**Lesson 6**:**Linear Regression**– how X causes Y**Lesson 7**:**Beta**– a measure of a stock’s volatility in relation to the overall market.**Lesson 8**:**CAPM**– Relationship between systematic risk and expected return

Sign up for my newsletter. Get my **FREE eBook** and do not miss the next **FREE** online course.

## How to get the most out of this online course?

To get the most out of this course you should do the following.

- Download the
**Jupyter Notebook**used*(link below)*. - Start your
**Jupyter Notebook***(it is free – see how to get it below)*. - Watch one lesson.
- Try it yourself in
**Jupyter Notebook**.

Alternatively, you can go to GitHub and use the links to **Colab** (no installation needed).

## New to Python + Jupyter Notebook + Anaconda?

If you are new to Python and Jypter Notebook and want to get started?

- Go to
**Anaconda**and download the individual edition. - It will install
**Python**and**Jupyter**notebook. That is all you need to get started and it is all free. - Launch Anaconda.
- In Anaconda launch Jupyter Notebook.
- Navigate in Jupyter Notebook to the downloaded Notebooks from the link (button) above.
- Alternatively, you can import them.

## Lesson 1: Get to know Pandas and NumPy

In this part we will get familiar with **NumPy**. We will assume familiarity with the **Pandas** library. If you are new to **Pandas** we will suggest you start with **this FREE 2h course**. This part will look at how **Pandas** and **NumPy** is connected.

### Learning objectives

- Refresher of working with
**Pandas**and**Pandas Datareader**to use them to read historic stock prices. - How
**Pandas****DataFrame**and**NumPy**arrays are related and different. - Calculations of
**return**of a**portfolio**, which is a primary evaluation factor of an investment.

See the Notebook used in lecture on GitHub.

### Lesson 1

First, we need some historic time series stock prices. This can be easily done with **Pandas Datareader**.

```
import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo("AAPL", start)
```

This will read historic stock prices from **Apple** (ticker **AAPL**) starting from 2020 and up until today. The data is in a **DataFrame** (**Pandas** main data structure).

It is a good habit to verify that the data is as expected to avoid surprises later in the process. That can be done by calling **head()** on the **DataFrame** **data**, which will show the first 5 lines.

```
data.head()
```

Resulting in.

```
High Low Open Close Volume Adj Close
Date
2020-01-02 75.150002 73.797501 74.059998 75.087502 135480400.0 74.333511
2020-01-03 75.144997 74.125000 74.287498 74.357498 146322800.0 73.610840
2020-01-06 74.989998 73.187500 73.447502 74.949997 118387200.0 74.197395
2020-01-07 75.224998 74.370003 74.959999 74.597504 108872000.0 73.848442
2020-01-08 76.110001 74.290001 74.290001 75.797501 132079200.0 75.036385
```

Recall that the index should be a **DatetimeIndex**. This makes it possible to take advantage of being a time series.

```
data.index
```

The above gives the index.

```
DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-06', '2020-01-07',
'2020-01-08', '2020-01-09', '2020-01-10', '2020-01-13',
'2020-01-14', '2020-01-15',
...
'2021-03-03', '2021-03-04', '2021-03-05', '2021-03-08',
'2021-03-09', '2021-03-10', '2021-03-11', '2021-03-12',
'2021-03-15', '2021-03-17'],
dtype='datetime64[ns]', name='Date', length=303, freq=None)
```

To remind ourselves further, we recall that each column in a **DataFrame** has a datatype.

```
data.dtypes
```

Shown below here.

```
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
dtype: object
```

The next step in our journey is to see how **NumPy** is different from **Pandas** **DataFrames**.

We can get the **DataFrame** as a **NumPy** array as follows.

```
arr = data.to_numpy()
```

The shape of a NumPy array gives the dimensions.

```
(303, 6)
```

Please notice, that you might get more rows than 303, as you run this later than we do here in the tutorial. There will be a row for each day open on the stock exchange market since beginning of 2020.

But you should get 6 columns, as there are 6 columns in our **DataFrame**, where the **NumPy** array comes from.

The first row of data can be accessed as follows.

```
arr[0]
```

Which gives the the data of the first row, as we know it from the **DataFrame**.

```
[7.51500015e+01 7.37975006e+01 7.40599976e+01 7.50875015e+01
1.35480400e+08 7.43335114e+01]
```

Notice the scientific notation. Other than that, you can see the figures are the same.

Now to an interesting difference from DataFrames. The NumPy array only has one datatype. That means, that all columns have the same datatype. The full array has the same datatype.

```
arr.dtype
```

Resulting in the following output.

```
dtype('float64')
```

To access the top 10 entries of the first column in our NumPy array (the one representing the High column), we can use the following notation.

```
small = arr[:10, 0].copy()
small
```

Which will output a one-dimensional array of size 10, containing the 10 first values of column 0.

```
array([75.15000153, 75.14499664, 74.98999786, 75.22499847, 76.11000061,
77.60749817, 78.16750336, 79.26750183, 79.39250183, 78.875 ])
```

Some nice functionality to master.

```
np.max(small)
small.max()
small.argmax()
```

Where the first two return the maximum value of the array, **small**. The **argmax() **returns the index of the maximum value.

The **NumPy** functionality works well on **DataFrames**, which comes in handy when working with financial data.

We can get the logarithm of values in a NumPy array as follows.

```
np.log(small)
```

Similarly, we can apply the logarithm on all entries in a DataFrame as follows.

```
np.log(data)
```

This is magic.

```
High Low Open Close Volume Adj Close
Date
2020-01-02 4.319486 4.301325 4.304876 4.318654 18.724338 4.308562
2020-01-03 4.319420 4.305753 4.307943 4.308885 18.801326 4.298792
2020-01-06 4.317355 4.293025 4.296571 4.316821 18.589471 4.306729
2020-01-07 4.320484 4.309053 4.316955 4.312107 18.505683 4.302015
2020-01-08 4.332180 4.307976 4.307976 4.328065 18.698912 4.317973
```

While the logarithm of all the columns here does not make sense. Later we will use this and it will all make sense.

We can calculate the daily return as follows.

```
data/data.shift()
```

Resulting in the following output (or first few lines).

```
High Low Open Close Volume Adj Close
Date
2020-01-02 NaN NaN NaN NaN NaN NaN
2020-01-03 0.999933 1.004438 1.003072 0.990278 1.080029 0.990278
2020-01-06 0.997937 0.987352 0.988693 1.007968 0.809082 1.007968
2020-01-07 1.003134 1.016157 1.020593 0.995297 0.919626 0.995297
2020-01-08 1.011765 0.998924 0.991062 1.016086 1.213160 1.016086
```

Let’s investigate that a bit. Recall the data (you can get the first 5 lines: **data.head()**)

```
High Low Open Close Volume Adj Close
Date
2020-01-02 75.150002 73.797501 74.059998 75.087502 135480400.0 74.333511
2020-01-03 75.144997 74.125000 74.287498 74.357498 146322800.0 73.610840
2020-01-06 74.989998 73.187500 73.447502 74.949997 118387200.0 74.197395
2020-01-07 75.224998 74.370003 74.959999 74.597504 108872000.0 73.848442
2020-01-08 76.110001 74.290001 74.290001 75.797501 132079200.0 75.036385
```

Notice the the calculation.

```
75.144997/75.150002
```

Gives.

```
0.9999333998687053
```

Wait. Hence the second row of H**i**gh divided by the first gives the same value of the second row of **data/data.shift()**.

This is no coincidence. The line takes each entry in **data** and divides it with the corresponding entry in **data.shift()**, and it happens that **data.shift()** is shifted one forward by date. Hence, it will divide by the previous row.

Now we understand that, let’s get back to the logarithm. Because, we love log returns. Why? Let’s see this example.

```
np.sum(np.log(data/data.shift()))
```

Giving.

```
High 0.502488
Low 0.507521
Open 0.515809
Close 0.492561
Volume -1.278826
Adj Close 0.502653
dtype: float64
```

And the following.

```
np.log(data/data.iloc[0]).tail(1)
```

Giving the following.

```
High Low Open Close Volume Adj Close
Date
2021-03-17 0.502488 0.507521 0.515809 0.492561 -1.278826 0.502653
```

Now why are we so excited about that?

Well, because we can sum the log daily returns and get the full return. This is really handy when we want to calculate returns of changing portfolios or similar.

We do not care where the log returns comes from. If our money was invested one day in one portfolio, we get the log return from that. The next day our money is invested in another portfolio. Then we get the log return from that. The sum of those two log returns give the full return.

That’s the magic.

We also cover how to reshape data in the video lecture.

Then we consider how to calculate with portfolio and get the return.

This requires us to read data from multiple tickers to create a portfolio.

```
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo(tickers, start)
```

This gives data in the following format.

```
Attributes Adj Close ... Volume
Symbols AAPL MSFT ... TWTR IBM
Date ...
2020-01-02 74.333511 158.571075 ... 10721100.0 3148600.0
2020-01-03 73.610840 156.596588 ... 14429500.0 2373700.0
2020-01-06 74.197395 157.001373 ... 12582500.0 2425500.0
2020-01-07 73.848442 155.569855 ... 13712900.0 3090800.0
2020-01-08 75.036385 158.047836 ... 14632400.0 4346000.0
```

Where the column has two layers of names. First, the attributes then the second layer of the tickers.

If we only want work with the Adj Close values, which is often the case, we can access them as follows.

```
data = data['Adj Close']
```

Giving data in the following format.

```
Symbols AAPL MSFT TWTR IBM
Date
2020-01-02 74.333511 158.571075 32.299999 126.975204
2020-01-03 73.610840 156.596588 31.520000 125.962540
2020-01-06 74.197395 157.001373 31.639999 125.737526
2020-01-07 73.848442 155.569855 32.540001 125.821907
2020-01-08 75.036385 158.047836 33.049999 126.872055
```

Now that is convenient.

Now consider a portfolio as follows.

```
portfolios = [.25, .15, .40, .20]
```

That is, 25%, 15%, 40%, and 20% to **AAPL**, **MSFT**, **TWTR**, and **IBM**, respectively.

Assume we have 10000 USD to invest as above.

```
(data/data.iloc[0])*portfolios*100000
```

What happened there. Well, first we normalize the data with **data/data.iloc[0]**. This was covered in the previous course.

Then we multiply with the portfolio and the amount we invest.

This result in the following.

```
Symbols AAPL MSFT TWTR IBM
Date
2020-01-02 25000.000000 15000.000000 40000.000000 20000.000000
2020-01-03 24756.949626 14813.223758 39034.057216 19840.494087
2020-01-06 24954.221177 14851.514331 39182.662708 19805.051934
2020-01-07 24836.860500 14716.100112 40297.215708 19818.342892
2020-01-08 25236.391776 14950.504296 40928.792592 19983.752826
```

As we can see the first row, this distributes the money as the portfolio is allocated. Then it shows how ti evolves.

We can get the sum of the full return as follows.

```
np.sum((data/data.iloc[0])*portfolios*100000, axis=1)
```

Where we show the summary here.

```
Date
2020-01-02 100000.000000
2020-01-03 98444.724688
2020-01-06 98793.450150
2020-01-07 99668.519212
2020-01-08 101099.441489
...
2021-03-10 162763.421409
2021-03-11 168255.248962
2021-03-12 167440.137240
2021-03-15 171199.207668
2021-03-17 169031.577658
Length: 303, dtype: float64
```

As you see, we start with 100000 USD and end with 169031 USD in this case. You might get a bit different result, as you run yours on a later day.

This is handy to explore a portfolio composition.

Actually, when we get to Monte Carlo Simulation, this will be handy. There, we will generate multiple random portfolios and calculate the return and risk for each of them, to optimize the portfolio composition.

A random portfolio can be generated as follows with **NumPy**.

```
weight = np.random.random(4)
weight /= weight.sum()
```

Notice, that we generate 4 random numbers (one for each ticker) and then we divide by the sum of them. This ensures the sum of the weights will be 1, hence, representing a portfolio.

This was the first lesson.

## Lesson 2: Volatility (Average True Range) and Risk

In this lesson we will learn about the **Volatility** of a stock and how it is one measure of **investment risk**.

### Learning objectives

- How volatility represent the risk.
- Calculation of
**Average True Range**(**ATR**) – a**volatility**and**risk**measure. - How to visualize the
**volatility**.

See the Notebook used in lecture on GitHub.

### Lesson 2

To get started, we need some historic stock prices. This can be done as follows and is covered in Lesson 1.

```
import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo("NFLX", start)
```

This reads the time series data of the historic stock prices of **Netflix** (ticker **NFLX**).

To calculate the **Average True Range** (**ATR**) we need a formula, which is given on investoperia.org.

The** Average True Range** (**ATR**) is a moving average of the **True Range** (**TR**). And the **TR** is given by the maximum of the current high (**H**) minus current low (**L**), the absolute value of current high (**H**) minus previous close (**Cp**), and the absolute value of current low (**L**) and previous close (**Cp**).

This can look intimidating at first, but don’t worry, this is where the power of **Pandas** **DataFrames** and **Series** come into the picture.

It is always a good idea to make your calculations simple.

```
high_low = data['High'] - data['Low']
high_cp = np.abs(data['High'] - data['Close'].shift())
low_cp = np.abs(data['Low'] - data['Close'].shift())
```

Here we make a **Series** for each of the values needed. Notice, that we get the previous close by using **shift()** (**data[‘Close’].shift()**).

Then a great way to get the maximum value of these is to create a DataFrame with all the values.

```
df = pd.concat([high_low, high_cp, low_cp], axis=1)
true_range = np.max(df, axis=1)
```

Now that is nice.

Then we get the ATR as the moving average of 14 days (14 days is the default).

```
average_true_range = true_range.rolling(14).mean()
```

Finally, let’s try to visualize it. Often visualization helps us understand it better.

```
import matplotlib.pyplot as plt
%matplotlib notebook
fig, ax = plt.subplots()
average_true_range.plot(ax=ax)
ax2 = data['Close'].plot(ax=ax, secondary_y=True, alpha=.3)
ax.set_ylabel("ATR")
ax2.set_ylabel("Price")
```

Resulting in the following chart.

This shows how the **Average True Range** (**ATR**) moves in relation to the stock price. The blue line is **ATR** and the orange (semitransparent is the stock price).

In periods with big changes in price, the **ATR** moves up. When the price is more staple, the **ATR** moves down.

In the next lesson we will learn about how to combine **Risk** and **Return** in one measure, **Sharpe Ratio**.

## Lesson 3: Sharpe Ratio – Combining Risk and Return

In this lesson we will combine **Risk** and **Return** into one number, the **Sharpe Ratio**.

### Learning objectives

**Risk**and**return**combined in one number.- Using standard deviation as a measure for
**risk**. **Sharpe Ratio**calculation combining**risk**and**return**.

See the Notebook used in lecture on GitHub.

### Lesson 3

The **Sharpe Ratio** is the average return earned in excess of the risk-free rate per unit of **volatility** or total risk.

The idea with **Sharpe Ratio**, is to have one number to represent both return and risk. This makes it easy to compare different weights of portfolios. We will use that in the next lesson about **Monte Carlo Simulations** for **Portfolio Optimization**.

Now that is a lot of words. How does the **Sharpe Ratio** look like.

We need the return of the portfolio and the risk free return, as well as the standard deviation of the portfolio.

- The
**return of the portfolio**we covered in lesson 1, but we will calculate it with log returns here. - It is custom for the
**risk free return**to use the**10 Year Treasury Note**, but as it has been low for long time, often 0 is used. - The
**standard deviation**is a measure of the**volatility**, and is used here to represent the**risk**. This is similar to**Average True Range**.

To get started, we need to read time series data of historic stock prices for a portfolio. This can be done as follows.

```
import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
```

Where our portfolio will consist of the tickers for Apple, Microsoft, Twitter and IBM (**AAPL**, **MSFT**, **TWTR**, **IBM**). We read the data from start 2020 from the** Yahoo! Finance API** using **Pandas Datareader.**

Finally, we only keep the **Adjusted Close** price.

Let’s assume our portfolio is balanced as follows, 25%, 15%, 40%, and 20% to **AAPL**, **MSFT**, **TWTR**, **IBM**, respectively.

Then we can calculate the daily log return of the portfolio as follows.

```
portfolio = [.25, .15, .40, .20]
log_return = np.sum(np.log(data/data.shift())*portfolio, axis=1)
```

Where we use the **np.log **to take the logarithm of the daily change, we apply the **portfolio**. Finally, we sum (**np.sum**) along the rows (**axis=1**).

For the fun, we can visualize the daily log returns as follows.

```
import matplotlib.pyplot as plt
%matplotlib notebook
fig, ax = plt.subplots()
log_return.hist(bins=50, ax=ax)
```

Resulting in this.

This gives an impression of how volatile the portfolio is. The more data is centered around 0.0, the less volatile and risky.

The **Sharpe Ratio** can be calculate directly as follows.

```
sharpe_ratio = log_return.mean()/log_return.std()
```

This gives a daily **Sharpe Ratio**, where we have the return to be the mean value. That is, the average return of the investment. And divided by the standard deviation.

The greater is the **standard deviation** the greater the magnitude of the **deviation** from the **mean** value can be expected.

To get an annualized Sharpe Ratio.

```
asr = sharpe_ratio*252**.5
```

This is the measure we will use in the next lesson, where we will optimize the portfolio using Monte Carlo Simulation.

## Lesson 4: Monte Carlo Simulation for Portfolio Optimization

In lesson we will learn about **Monte Carlo Simulation.** First an introduction to the concept and then how to use **Sharpe Ratio** to find the optimal **portfolio** with **Monte Carlo Simulation**.

### Learning objectives

- The principles behind
**Monte Carlo Simulation** - Applying
**Monte Carlo Simulation**using**Sharpe Ratio**to get the optimal**portfolio** - Create a visual
**Efficient Frontier**based on**Sharpe Ratio**

See the Notebook used in lecture on GitHub.

### Lesson 4

**Monte Carlo Simulation** is a great tool to master. It can be used to simulate risk and uncertainty that can affect the outcome of different decision options.

Simply said, if there are too many variables affecting the outcome, then it can simulate them and find the optimal based on the values.

Here we will first use it for simple example, which we can precisely calculate. This is only to get an idea of what Monte Carlo Simulations can do for us.

The game we play.

- You roll two dice.
- When you roll 7, then you gain 5 dollars.
- If you roll anything else than 7, you lose 1 dollar.

How can we simulate this game?

Well, the roll of two dice can be simulated with **NumPy** as follows.

```
import numpy as np
def roll_dice():
return np.sum(np.random.randint(1, 7, 2))
```

Where are roll is simulated with a call to the **roll_dice()**. It simply uses the **np.random.randint(1, 7, 2)**, which returns an array of length 2 with 2 integers in the range 1 to 7 (where 7 is not included, but 1 is). The **np.sum(…) **sums the two integers into the sum of the two simulated dice.

Now to the Monte Carlo Simulation.

This is simply to make a trial run and then see if it is a good game or not.

```
def monte_carlo_simulation(runs=1000):
results = np.zeros(2)
for _ in range(runs):
if roll_dice() == 7:
results[0] += 1
else:
results[1] += 1
return results
```

This is done by keeping track of the how many games I win and lose.

A run could look like this.

```
monte_carlo_simulation()
```

It could return **array([176., 824.])**, which would result in my win of 176*5 = 880 USD and lose of 824 USD. A total gain of 56 USD.

Each run will most likely give different conclusions.

A way to get a more precise picture is to make more runs. Here, we will try to record a series of runs and visualize them.

```
results = np.zeros(1000)
for i in range(1000):
results[i] = monte_carlo_simulation()[0]
import matplotlib.pyplot as plt
%matplotlib notebook
fig, ax = plt.subplots()
ax.hist(results, bins=15)
```

Resulting in this figure.

This gives an idea of how a game of 1000 rolls returns and how volatile it is. See, if the game was less volatile, it would center around one place.

For these particular runs we have that **results.mean()*5** gives the average return of **833.34 USD** (notice, you will not get the exact same number due to the randomness involved).

The average loss will be **1000 – results.mean()** **= 833.332 USD**.

This looks like a pretty even game.

Can we calculate this exactly?

Yes. The reason is, that this is a simple situation are simulating. When we have more variable (as we will have in a moment with portfolio simulation) it will not be the case.

A nice way to visualize it is as follows.

```
d1 = np.arange(1, 7)
d2 = np.arange(1, 7)
mat = np.add.outer(d1, d2)
```

Where the matrix **mat** looks as follows.

```
array([[ 2, 3, 4, 5, 6, 7],
[ 3, 4, 5, 6, 7, 8],
[ 4, 5, 6, 7, 8, 9],
[ 5, 6, 7, 8, 9, 10],
[ 6, 7, 8, 9, 10, 11],
[ 7, 8, 9, 10, 11, 12]])
```

The exact probability for rolling 7 is.

```
mat[mat == 7].size/mat.size
```

Where we count how many occurrences of 7 divided by the number of possibilities. This gives 0.16666666666666667 or 1/5.

Hence, it seems to be a fair game with no advantage. This is the same we concluded with the Monte Carlo Simulation.

Now we have some understanding of Monte Carlo Simulation, we are ready to use it for portfolio optimization.

To do that, we need to read some time series of historic stock prices.

```
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
```

To use it with Sharpe Ratio, we will calculate the log returns.

```
log_returns = np.log(data/data.shift())
```

The Monte Carlo Simulations can be done as follows.

```
# Monte Carlo Simulation
n = 5000
weights = np.zeros((n, 4))
exp_rtns = np.zeros(n)
exp_vols = np.zeros(n)
sharpe_ratios = np.zeros(n)
for i in range(n):
weight = np.random.random(4)
weight /= weight.sum()
weights[i] = weight
exp_rtns[i] = np.sum(log_returns.mean()*weight)*252
exp_vols[i] = np.sqrt(np.dot(weight.T, np.dot(log_returns.cov()*252, weight)))
sharpe_ratios[i] = exp_rtns[i] / exp_vols[i]
```

The code will run 5000 experiments. We will keep all the data from each run. The weights of the portfolios (**weights**), the expected return (**exp_rtns**), the expected volatility (**exp_vols**) and the Sharpe Ratio (**sharpe_ratios**).

Then we iterate over the range.

First we create a random portfolio in **weight** (notice it will have the sum 1). Then we calculate the expected annual return. The expected volatility is calculated a bit different than we learned in the lesson about Sharpe Ratio. This is only to make it perform faster.

Finally, the Sharpe Ratio is calculated.

In this specific run (you might get different values) we get that the maximum Sharpe Ratio is, given by **sharpe_ratios.max(), **1.1398396630767385.

To get the optimal weight (portfolio), call **weights[sharpe_ratios.argmax()]**. In this specific run, **array([4.57478167e-01, 6.75247425e-02, 4.74612301e-01, 3.84789577e-04])**. This concludes to hold 45.7% to **AAPL**, 6.7% to **MSFT**, 47.5% to **TWTR**, and 0,03% to **IBM** is optimal.

This can be visualized as follows in an **Efficient Frontier**.

```
import matplotlib.pyplot as plt
%matplotlib notebook
fig, ax = plt.subplots()
ax.scatter(exp_vols, exp_rtns, c=sharpe_ratios)
ax.scatter(exp_vols[sharpe_ratios.argmax()], exp_rtns[sharpe_ratios.argmax()], c='r')
ax.set_xlabel('Expected Volatility')
ax.set_ylabel('Expected Return')
```

Resulting in this chart.

## Lesson 5: Correlation Calculations

In this lesson we will learn about **correlation** of assets, calculations of **correlation**, and **risk** and **coherence**.

### Learning objectives

- What is
**correlation**and how to use it - Calculate
**correlation** - Find
**negatively correlated**assets

See the Notebook used in lecture on GitHub.

### Lesson 5

**Correlation** is a statistic that measures the degree to which two variables move in relation to each other. **Correlation** measures association, but doesn’t show if x causes y or vice versa.

The correlation between two stocks is a number form -1 to 1 (both inclusive).

- A positive correlation means, when stock x goes up, we expect stock y to go up, and opposite.
- A negative correlation means, when stock x goes up, we expect stock y to go down, and opposite.
- A zero correlation, we cannot say anything in relation to each other.

The formula for calculating the correlation is quite a mouthful.

Luckily, the **DataFrames** can calculate it for us. Hence, we do not need to master how to do it.

Let’s get started. First, we need to load some time series of historic stock prices.

```
import pandas as pd
import pandas_datareader as pdr
import datetime as dt
import numpy as np
tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT']
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
log_returns = np.log(data/data.shift())
```

Where we also calculate the log returns.

The correlation can be calculated as follows.

```
log_returns.corr()
```

That was easy, right? Remember we do it on the log returns to keep it on the same range.

```
Symbols AAPL TWTR IBM MSFT
Symbols
AAPL 1.000000 0.531973 0.518204 0.829547
TWTR 0.531973 1.000000 0.386493 0.563909
IBM 0.518204 0.386493 1.000000 0.583205
MSFT 0.829547 0.563909 0.583205 1.000000
```

We identify, that the correlation on the diagonal is 1.0. This is obvious, since the diagonal shows the correlation between itself (AAPL and AAPL, and so forth).

Other than that, we can conclude that AAPL and MSFT are correlated the most.

Let’s add the S&P 500 to our **DataFrame**.

```
sp500 = pdr.get_data_yahoo("^GSPC", start)
log_returns['SP500'] = np.log(sp500['Adj Close']/sp500['Adj Close'].shift())
log_returns.corr()
```

Resulting in this.

Where we see that AAPL and MSFT are mostly correlated to S&P 500 index. This is not surprising, as they are a big part of the weight of the market cap in the index.

```
def test_correlation(ticker):
df = pdr.get_data_yahoo(ticker, start)
lr = log_returns.copy()
lr[ticker] = np.log(df['Adj Close']/df['Adj Close'].shift())
return lr.corr()
```

This can help us find assets with a negative correlation.

Why do we wan that? Well, to minimize the risk. Read my eBook on the subject if you want to learn more about that.

Now, let’s test.

```
test_correlation("TLT")
```

Resulting in this following.

The negative correlation we are looking for.

This can be visualized to get a better understanding as follows.

```
import matplotlib.pyplot as plt
%matplotlib notebook
def visualize_correlation(ticker1, ticker2):
df = pdr.get_data_yahoo([ticker1, ticker2], start)
df = df['Adj Close']
df = df/df.iloc[0]
fig, ax = plt.subplots()
df.plot(ax=ax)
```

With **visualize_correlation(“AAPL”, “TLT”)** we get.

Where we see, when AAPL goes down, the TLT goes up.

And if we look at **visualize_correlation(“^GSPC”, “TLT”)** (the S&P 500 index and TLT).

## Lesson 6: Linear Regression Calculations

In this lesson we will learn about **Linear Regression, **difference from **Correlation** and how to visualize **Linear Regression**.

### Learning objectives

- Understand the difference between
**Linear Regression**and**Correlation**. - Understand the difference between
**true random**and**correlated**variables **Visualize linear regression.**

See the Notebook used in lecture on GitHub.

### Lesson 6

Let’s first see what the similarities and difference between Linear Regression and Correlation is.

**Similarities**.

- Quantify the direction and strength of the relationship between two variables, here we look at stock prices.

**Differences**.

- Correlation is a single statistic. It is just a number between -1 and 1 (both inclusive).
- Linear regression produces an equation.

A great way to learn about relationships between variables is to compare it to random variables.

Let’s start by doing that.

```
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib notebook
X = np.random.randn(5000)
Y = np.random.randn(5000)
fig, ax = plt.subplots()
ax.scatter(X, Y, alpha=.2)
```

Giving the following scatter chart.

Which shows the how two non-correlated variables look like.

To compare that to two correlated, we need some data.

```
tickers = ['AAPL', 'TWTR', 'IBM', 'MSFT', '^GSPC']
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo(tickers, start)
data = data['Adj Close']
log_returns = np.log(data/data.shift())
```

Let’s make a function to calculate the Liner Regression and visualize it.

```
def linear_regression(ticker_a, ticker_b):
X = log_returns[ticker_a].iloc[1:].to_numpy().reshape(-1, 1)
Y = log_returns[ticker_b].iloc[1:].to_numpy().reshape(-1, 1)
lin_regr = LinearRegression()
lin_regr.fit(X, Y)
Y_pred = lin_regr.predict(X)
alpha = lin_regr.intercept_[0]
beta = lin_regr.coef_[0, 0]
fig, ax = plt.subplots()
ax.set_title("Alpha: " + str(round(alpha, 5)) + ", Beta: " + str(round(beta, 3)))
ax.scatter(X, Y)
ax.plot(X, Y_pred, c='r')
```

The function takes the two tickers and get’s the log returns in **NumPy** arrays. They are reshaped to fit the required format.

The the Linear Regression model (**LinearRegression**) is used and applied to predict values. The alpha and beta are the liner variables. Finally, we scatter plot all the points and a prediction line.

Let’s try** linear_regression(“AAPL”, “^GSPC”)**.

Where we see the red line as the prediction line.

Other examples **linear_regression(“AAPL”, “MSFT”)**

And **linear_regression(“AAPL”, “TWTR”)**.

Where it visually shows that **AAPL** and **TWTR** are not as closely correlated as the other examples.

## Lesson 7: Beta and S&P 500

In this lesson we will learn about market **Beta** with S&P 500 index, how to calculate it, and comparison of calculations from last lesson.

### Learning Objectives

- Understand what market
**Beta**tells you. - How to calculate the market (S&P 500)
**Beta**. - See how
**Beta**is related with**Linear Regression**.

See the Notebook used in lecture on GitHub.

### Lesson 7

**Beta** is a measure of a stock’s volatility in relation to the overall market (S&P 500). The S&P 500 index has **Beta** 1.

High-beta stocks are supposed to be riskier but provide higher potential return. While, low-beta stocks pose less risk but also lower returns.

**Interpretation**

**Beta**above 1: stock is more**volatile**than the market, but expects higher return.**Beta**below 1: stock with lower**volatility**, and expects less return.

The formula for **Beta** is **Covariance** divided by **variance**.

This sound more scary than it is.

The Beta on financial pages, like **Yahoo! Finance**, are calculated on the monthly price.

Let’s make an example here.

```
import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
from sklearn.linear_model import LinearRegression
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM', '^GSPC']
start = dt.datetime(2015, 12, 1)
end = dt.datetime(2021, 1, 1)
data = pdr.get_data_yahoo(tickers, start, end, interval="m")
data = data['Adj Close']
log_returns = np.log(data/data.shift())
```

Where we notice that we read data on **interval=”m”**, which gives the monthly data.

Then the **Beta** is calculated as follows.

```
cov = log_returns.cov()
var = log_returns['^GSPC'].var()
cov.loc['AAPL', '^GSPC']/var
```

For Apple, it was 1.25.

If you wonder if it is related to the **Beta** value from **Linear** **Regression**. Let’s check it out.

```
X = log_returns['^GSPC'].iloc[1:].to_numpy().reshape(-1, 1)
Y = log_returns['AAPL'].iloc[1:].to_numpy().reshape(-1, 1)
lin_regr = LinearRegression()
lin_regr.fit(X, Y)
lin_regr.coef_[0, 0]
```

Also giving 1.25. Hence, it is the same calculation behind it.

## Lesson 8: Capital Asset Pricing Model (CAPM)

In this lesson we will learn about the **CAPM**.

### Learning objectives

- Understand the
**CAPM**(Capital Asset Pricing Model). **Beta**and**CAPM**calculations.**Expected return**of an investment.

See the Notebook used in lecture on GitHub.

### Lesson 8

The CA**P**M calculates the relationship between systematic risk and expected return. There are several assumptions behind the **CAPM** formula that have been shown not to hold in reality. But still, the CAPM formula is still widely used.

The formula is as follows.

Let’s get some data and calculate it.

```
import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM', '^GSPC']
start = dt.datetime(2015, 12, 1)
end = dt.datetime(2021, 1, 1)
data = pdr.get_data_yahoo(tickers, start, end, interval="m")
data = data['Adj Close']
log_returns = np.log(data/data.shift())
```

Again, when we look at the formula, the risk free return is often set to 0. Otherwise, the 10 years treasury note is used. Here, we use 1.38%. You can update it for more up to date value with the link.

```
cov = log_returns.cov()
var = log_returns['^GSPC'].var()
beta = cov.loc['AAPL', '^GSPC']/var
risk_free_return = 0.0138
market_return = .105
expected_return = risk_free_return + beta*(market_return - risk_free_return)
```

Notice, you can calculate it all simultaneously.

```
cov = log_returns.cov()
var = log_returns['^GSPC'].var()
beta = cov.loc['^GSPC']/var
risk_free_return = 0.0138
market_return = .105
expected_return = risk_free_return + beta*(market_return - risk_free_return)
```

## What’s next?

Congratulations and thank you for taking the time to take this course.

How can you help me?

This is free content and has taken me time to create. A great way to support me for free is to share this with friends and colleagues that could benefit from it. Also, share it on social media with your recommendation.

If you like to learn more about investing and financial analysis then check out my course on Udemy.