## What will we cover?

In this part we will get familiar with **NumPy**. We will assume familiarity with the **Pandas** library. If you are new to **Pandas** we will suggest you start with **this FREE 2h course**. This part will look at how **Pandas** and **NumPy** is connected.

In this tutorial we will cover the following.

- Refresher of working with
**Pandas**and**Pandas Datareader**to use them to read historic stock prices. - How
**Pandas****DataFrame**and**NumPy**arrays are related and different. - Calculations of
**return**of a**portfolio**, which is a primary evaluation factor of an investment.

## Step 1: Get some data with Pandas Datareader

First, we need some historic time series stock prices. This can be easily done with **Pandas Datareader**.

```
import numpy as np
import pandas_datareader as pdr
import datetime as dt
import pandas as pd
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo("AAPL", start)
```

This will read historic stock prices from **Apple** (ticker **AAPL**) starting from 2020 and up until today. The data is in a **DataFrame** (**Pandas** main data structure).

It is a good habit to verify that the data is as expected to avoid surprises later in the process. That can be done by calling **head()** on the **DataFrame** **data**, which will show the first 5 lines.

```
data.head()
```

Resulting in.

```
High Low Open Close Volume Adj Close
Date
2020-01-02 75.150002 73.797501 74.059998 75.087502 135480400.0 74.333511
2020-01-03 75.144997 74.125000 74.287498 74.357498 146322800.0 73.610840
2020-01-06 74.989998 73.187500 73.447502 74.949997 118387200.0 74.197395
2020-01-07 75.224998 74.370003 74.959999 74.597504 108872000.0 73.848442
2020-01-08 76.110001 74.290001 74.290001 75.797501 132079200.0 75.036385
```

Recall that the index should be a **DatetimeIndex**. This makes it possible to take advantage of being a time series.

```
data.index
```

The above gives the index.

```
DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-06', '2020-01-07',
'2020-01-08', '2020-01-09', '2020-01-10', '2020-01-13',
'2020-01-14', '2020-01-15',
...
'2021-03-03', '2021-03-04', '2021-03-05', '2021-03-08',
'2021-03-09', '2021-03-10', '2021-03-11', '2021-03-12',
'2021-03-15', '2021-03-17'],
dtype='datetime64[ns]', name='Date', length=303, freq=None)
```

To remind ourselves further, we recall that each column in a **DataFrame** has a datatype.

```
data.dtypes
```

Shown below here.

```
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
dtype: object
```

## Step 2: Investigate how NumPy is different from DataFrames (pandas)

The next step in our journey is to see how **NumPy** is different from **Pandas** **DataFrames**.

We can get the **DataFrame** as a **NumPy** array as follows.

```
arr = data.to_numpy()
```

The shape of a NumPy array gives the dimensions.

```
(303, 6)
```

Please notice, that you might get more rows than 303, as you run this later than we do here in the tutorial. There will be a row for each day open on the stock exchange market since beginning of 2020.

But you should get 6 columns, as there are 6 columns in our **DataFrame**, where the **NumPy** array comes from.

The first row of data can be accessed as follows.

```
arr[0]
```

Which gives the the data of the first row, as we know it from the **DataFrame**.

```
[7.51500015e+01 7.37975006e+01 7.40599976e+01 7.50875015e+01
1.35480400e+08 7.43335114e+01]
```

Notice the scientific notation. Other than that, you can see the figures are the same.

Now to an interesting difference from DataFrames. The NumPy array only has one datatype. That means, that all columns have the same datatype. The full array has the same datatype.

```
arr.dtype
```

Resulting in the following output.

```
dtype('float64')
```

To access the top 10 entries of the first column in our NumPy array (the one representing the High column), we can use the following notation.

```
small = arr[:10, 0].copy()
small
```

Which will output a one-dimensional array of size 10, containing the 10 first values of column 0.

```
array([75.15000153, 75.14499664, 74.98999786, 75.22499847, 76.11000061,
77.60749817, 78.16750336, 79.26750183, 79.39250183, 78.875 ])
```

## Step 3: NumPy functionality

Some nice functionality to master.

```
np.max(small)
small.max()
small.argmax()
```

Where the first two return the maximum value of the array, **small**. The **argmax() **returns the index of the maximum value.

The **NumPy** functionality works well on **DataFrames**, which comes in handy when working with financial data.

We can get the logarithm of values in a NumPy array as follows.

```
np.log(small)
```

Similarly, we can apply the logarithm on all entries in a DataFrame as follows.

```
np.log(data)
```

This is magic.

```
High Low Open Close Volume Adj Close
Date
2020-01-02 4.319486 4.301325 4.304876 4.318654 18.724338 4.308562
2020-01-03 4.319420 4.305753 4.307943 4.308885 18.801326 4.298792
2020-01-06 4.317355 4.293025 4.296571 4.316821 18.589471 4.306729
2020-01-07 4.320484 4.309053 4.316955 4.312107 18.505683 4.302015
2020-01-08 4.332180 4.307976 4.307976 4.328065 18.698912 4.317973
```

While the logarithm of all the columns here does not make sense. Later we will use this and it will all make sense.

## Step 4: Calculate the daily return

We can calculate the daily return as follows.

```
data/data.shift()
```

Resulting in the following output (or first few lines).

```
High Low Open Close Volume Adj Close
Date
2020-01-02 NaN NaN NaN NaN NaN NaN
2020-01-03 0.999933 1.004438 1.003072 0.990278 1.080029 0.990278
2020-01-06 0.997937 0.987352 0.988693 1.007968 0.809082 1.007968
2020-01-07 1.003134 1.016157 1.020593 0.995297 0.919626 0.995297
2020-01-08 1.011765 0.998924 0.991062 1.016086 1.213160 1.016086
```

Let’s investigate that a bit. Recall the data (you can get the first 5 lines: **data.head()**)

```
High Low Open Close Volume Adj Close
Date
2020-01-02 75.150002 73.797501 74.059998 75.087502 135480400.0 74.333511
2020-01-03 75.144997 74.125000 74.287498 74.357498 146322800.0 73.610840
2020-01-06 74.989998 73.187500 73.447502 74.949997 118387200.0 74.197395
2020-01-07 75.224998 74.370003 74.959999 74.597504 108872000.0 73.848442
2020-01-08 76.110001 74.290001 74.290001 75.797501 132079200.0 75.036385
```

Notice the the calculation.

```
75.144997/75.150002
```

Gives.

```
0.9999333998687053
```

Wait. Hence the second row of H**i**gh divided by the first gives the same value of the second row of **data/data.shift()**.

This is no coincidence. The line takes each entry in **data** and divides it with the corresponding entry in **data.shift()**, and it happens that **data.shift()** is shifted one forward by date. Hence, it will divide by the previous row.

Now we understand that, let’s get back to the logarithm. Because, we love log returns. Why? Let’s see this example.

```
np.sum(np.log(data/data.shift()))
```

Giving.

```
High 0.502488
Low 0.507521
Open 0.515809
Close 0.492561
Volume -1.278826
Adj Close 0.502653
dtype: float64
```

And the following.

```
np.log(data/data.iloc[0]).tail(1)
```

Giving the following.

```
High Low Open Close Volume Adj Close
Date
2021-03-17 0.502488 0.507521 0.515809 0.492561 -1.278826 0.502653
```

Now why are we so excited about that?

Well, because we can sum the log daily returns and get the full return. This is really handy when we want to calculate returns of changing portfolios or similar.

We do not care where the log returns comes from. If our money was invested one day in one portfolio, we get the log return from that. The next day our money is invested in another portfolio. Then we get the log return from that. The sum of those two log returns give the full return.

That’s the magic.

## Step 5: Reading data from multiple tickers

We also cover how to reshape data in the video lecture.

Then we consider how to calculate with portfolio and get the return.

This requires us to read data from multiple tickers to create a portfolio.

```
tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
start = dt.datetime(2020, 1, 1)
data = pdr.get_data_yahoo(tickers, start)
```

This gives data in the following format.

```
Attributes Adj Close ... Volume
Symbols AAPL MSFT ... TWTR IBM
Date ...
2020-01-02 74.333511 158.571075 ... 10721100.0 3148600.0
2020-01-03 73.610840 156.596588 ... 14429500.0 2373700.0
2020-01-06 74.197395 157.001373 ... 12582500.0 2425500.0
2020-01-07 73.848442 155.569855 ... 13712900.0 3090800.0
2020-01-08 75.036385 158.047836 ... 14632400.0 4346000.0
```

Where the column has two layers of names. First, the attributes then the second layer of the tickers.

If we only want work with the Adj Close values, which is often the case, we can access them as follows.

```
data = data['Adj Close']
```

Giving data in the following format.

```
Symbols AAPL MSFT TWTR IBM
Date
2020-01-02 74.333511 158.571075 32.299999 126.975204
2020-01-03 73.610840 156.596588 31.520000 125.962540
2020-01-06 74.197395 157.001373 31.639999 125.737526
2020-01-07 73.848442 155.569855 32.540001 125.821907
2020-01-08 75.036385 158.047836 33.049999 126.872055
```

Now that is convenient.

## Step 6: Calculate a portfolio holdings

Now consider a portfolio as follows.

```
portfolios = [.25, .15, .40, .20]
```

That is, 25%, 15%, 40%, and 20% to **AAPL**, **MSFT**, **TWTR**, and **IBM**, respectively.

Assume we have 10000 USD to invest as above.

```
(data/data.iloc[0])*portfolios*100000
```

What happened there. Well, first we normalize the data with **data/data.iloc[0]**. This was covered in the previous course.

Then we multiply with the portfolio and the amount we invest.

This result in the following.

```
Symbols AAPL MSFT TWTR IBM
Date
2020-01-02 25000.000000 15000.000000 40000.000000 20000.000000
2020-01-03 24756.949626 14813.223758 39034.057216 19840.494087
2020-01-06 24954.221177 14851.514331 39182.662708 19805.051934
2020-01-07 24836.860500 14716.100112 40297.215708 19818.342892
2020-01-08 25236.391776 14950.504296 40928.792592 19983.752826
```

As we can see the first row, this distributes the money as the portfolio is allocated. Then it shows how ti evolves.

We can get the sum of the full return as follows.

```
np.sum((data/data.iloc[0])*portfolios*100000, axis=1)
```

Where we show the summary here.

```
Date
2020-01-02 100000.000000
2020-01-03 98444.724688
2020-01-06 98793.450150
2020-01-07 99668.519212
2020-01-08 101099.441489
...
2021-03-10 162763.421409
2021-03-11 168255.248962
2021-03-12 167440.137240
2021-03-15 171199.207668
2021-03-17 169031.577658
Length: 303, dtype: float64
```

As you see, we start with 100000 USD and end with 169031 USD in this case. You might get a bit different result, as you run yours on a later day.

This is handy to explore a portfolio composition.

Actually, when we get to Monte Carlo Simulation, this will be handy. There, we will generate multiple random portfolios and calculate the return and risk for each of them, to optimize the portfolio composition.

A random portfolio can be generated as follows with **NumPy**.

```
weight = np.random.random(4)
weight /= weight.sum()
```

Notice, that we generate 4 random numbers (one for each ticker) and then we divide by the sum of them. This ensures the sum of the weights will be 1, hence, representing a portfolio.

This was the first lesson.

## Want more?

Check out the full 2.5 course with prepared JuPyter Notebooks to follow along.