Get started with Pandas and NumPy for Finance for Risk and Return

What will we cover?

In this part we will get familiar with NumPy. We will assume familiarity with the Pandas library. If you are new to Pandas we will suggest you start with this FREE 2h course. This part will look at how Pandas and NumPy is connected.

In this tutorial we will cover the following.

  • Refresher of working with Pandas and Pandas Datareader to use them to read historic stock prices.
  • How Pandas DataFrame and NumPy arrays are related and different.
  • Calculations of return of a portfolio, which is a primary evaluation factor of an investment.
  • Step 1: Get some data with Pandas Datareader

    First, we need some historic time series stock prices. This can be easily done with Pandas Datareader.

    import numpy as np
    import pandas_datareader as pdr
    import datetime as dt
    import pandas as pd
    
    start = dt.datetime(2020, 1, 1)
    data = pdr.get_data_yahoo("AAPL", start)
    

    This will read historic stock prices from Apple (ticker AAPL) starting from 2020 and up until today. The data is in a DataFrame (Pandas main data structure).

    It is a good habit to verify that the data is as expected to avoid surprises later in the process. That can be done by calling head() on the DataFrame data, which will show the first 5 lines.

    data.head()
    

    Resulting in.

                     High        Low       Open      Close       Volume  Adj Close
    Date                                                                          
    2020-01-02  75.150002  73.797501  74.059998  75.087502  135480400.0  74.333511
    2020-01-03  75.144997  74.125000  74.287498  74.357498  146322800.0  73.610840
    2020-01-06  74.989998  73.187500  73.447502  74.949997  118387200.0  74.197395
    2020-01-07  75.224998  74.370003  74.959999  74.597504  108872000.0  73.848442
    2020-01-08  76.110001  74.290001  74.290001  75.797501  132079200.0  75.036385
    

    Recall that the index should be a DatetimeIndex. This makes it possible to take advantage of being a time series.

    data.index
    

    The above gives the index.

    DatetimeIndex(['2020-01-02', '2020-01-03', '2020-01-06', '2020-01-07',
                   '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-13',
                   '2020-01-14', '2020-01-15',
                   ...
                   '2021-03-03', '2021-03-04', '2021-03-05', '2021-03-08',
                   '2021-03-09', '2021-03-10', '2021-03-11', '2021-03-12',
                   '2021-03-15', '2021-03-17'],
                  dtype='datetime64[ns]', name='Date', length=303, freq=None)
    

    To remind ourselves further, we recall that each column in a DataFrame has a datatype.

    data.dtypes
    

    Shown below here.

    High         float64
    Low          float64
    Open         float64
    Close        float64
    Volume       float64
    Adj Close    float64
    dtype: object
    

    Step 2: Investigate how NumPy is different from DataFrames (pandas)

    The next step in our journey is to see how NumPy is different from Pandas DataFrames.

    We can get the DataFrame as a NumPy array as follows.

    arr = data.to_numpy()
    

    The shape of a NumPy array gives the dimensions.

    (303, 6)
    

    Please notice, that you might get more rows than 303, as you run this later than we do here in the tutorial. There will be a row for each day open on the stock exchange market since beginning of 2020.

    But you should get 6 columns, as there are 6 columns in our DataFrame, where the NumPy array comes from.

    The first row of data can be accessed as follows.

    arr[0]
    

    Which gives the the data of the first row, as we know it from the DataFrame.

    [7.51500015e+01 7.37975006e+01 7.40599976e+01 7.50875015e+01
     1.35480400e+08 7.43335114e+01]
    

    Notice the scientific notation. Other than that, you can see the figures are the same.

    Now to an interesting difference from DataFrames. The NumPy array only has one datatype. That means, that all columns have the same datatype. The full array has the same datatype.

    arr.dtype
    

    Resulting in the following output.

    dtype('float64')
    

    To access the top 10 entries of the first column in our NumPy array (the one representing the High column), we can use the following notation.

    small = arr[:10, 0].copy()
    small
    

    Which will output a one-dimensional array of size 10, containing the 10 first values of column 0.

    array([75.15000153, 75.14499664, 74.98999786, 75.22499847, 76.11000061,
           77.60749817, 78.16750336, 79.26750183, 79.39250183, 78.875     ])
    

    Step 3: NumPy functionality

    Some nice functionality to master.

    np.max(small)
    small.max()
    small.argmax()
    

    Where the first two return the maximum value of the array, small. The argmax() returns the index of the maximum value.

    The NumPy functionality works well on DataFrames, which comes in handy when working with financial data.

    We can get the logarithm of values in a NumPy array as follows.

    np.log(small)
    

    Similarly, we can apply the logarithm on all entries in a DataFrame as follows.

    np.log(data)
    

    This is magic.

                    High       Low      Open     Close     Volume  Adj Close
    Date                                                                    
    2020-01-02  4.319486  4.301325  4.304876  4.318654  18.724338   4.308562
    2020-01-03  4.319420  4.305753  4.307943  4.308885  18.801326   4.298792
    2020-01-06  4.317355  4.293025  4.296571  4.316821  18.589471   4.306729
    2020-01-07  4.320484  4.309053  4.316955  4.312107  18.505683   4.302015
    2020-01-08  4.332180  4.307976  4.307976  4.328065  18.698912   4.317973
    

    While the logarithm of all the columns here does not make sense. Later we will use this and it will all make sense.

    Step 4: Calculate the daily return

    We can calculate the daily return as follows.

    data/data.shift()
    

    Resulting in the following output (or first few lines).

                    High       Low      Open     Close    Volume  Adj Close
    Date                                                                   
    2020-01-02       NaN       NaN       NaN       NaN       NaN        NaN
    2020-01-03  0.999933  1.004438  1.003072  0.990278  1.080029   0.990278
    2020-01-06  0.997937  0.987352  0.988693  1.007968  0.809082   1.007968
    2020-01-07  1.003134  1.016157  1.020593  0.995297  0.919626   0.995297
    2020-01-08  1.011765  0.998924  0.991062  1.016086  1.213160   1.016086
    

    Let’s investigate that a bit. Recall the data (you can get the first 5 lines: data.head())

                     High        Low       Open      Close       Volume  Adj Close
    Date                                                                          
    2020-01-02  75.150002  73.797501  74.059998  75.087502  135480400.0  74.333511
    2020-01-03  75.144997  74.125000  74.287498  74.357498  146322800.0  73.610840
    2020-01-06  74.989998  73.187500  73.447502  74.949997  118387200.0  74.197395
    2020-01-07  75.224998  74.370003  74.959999  74.597504  108872000.0  73.848442
    2020-01-08  76.110001  74.290001  74.290001  75.797501  132079200.0  75.036385
    

    Notice the the calculation.

    75.144997/75.150002
    

    Gives.

    0.9999333998687053
    

    Wait. Hence the second row of High divided by the first gives the same value of the second row of data/data.shift().

    This is no coincidence. The line takes each entry in data and divides it with the corresponding entry in data.shift(), and it happens that data.shift() is shifted one forward by date. Hence, it will divide by the previous row.

    Now we understand that, let’s get back to the logarithm. Because, we love log returns. Why? Let’s see this example.

    np.sum(np.log(data/data.shift()))
    

    Giving.

    High         0.502488
    Low          0.507521
    Open         0.515809
    Close        0.492561
    Volume      -1.278826
    Adj Close    0.502653
    dtype: float64
    

    And the following.

    np.log(data/data.iloc[0]).tail(1)
    

    Giving the following.

                    High       Low      Open     Close    Volume  Adj Close
    Date                                                                   
    2021-03-17  0.502488  0.507521  0.515809  0.492561 -1.278826   0.502653
    

    Now why are we so excited about that?

    Well, because we can sum the log daily returns and get the full return. This is really handy when we want to calculate returns of changing portfolios or similar.

    We do not care where the log returns comes from. If our money was invested one day in one portfolio, we get the log return from that. The next day our money is invested in another portfolio. Then we get the log return from that. The sum of those two log returns give the full return.

    That’s the magic.

    Step 5: Reading data from multiple tickers

    We also cover how to reshape data in the video lecture.

    Then we consider how to calculate with portfolio and get the return.

    This requires us to read data from multiple tickers to create a portfolio.

    tickers = ['AAPL', 'MSFT', 'TWTR', 'IBM']
    start = dt.datetime(2020, 1, 1)
    
    data = pdr.get_data_yahoo(tickers, start)
    

    This gives data in the following format.

    Attributes   Adj Close              ...      Volume           
    Symbols           AAPL        MSFT  ...        TWTR        IBM
    Date                                ...                       
    2020-01-02   74.333511  158.571075  ...  10721100.0  3148600.0
    2020-01-03   73.610840  156.596588  ...  14429500.0  2373700.0
    2020-01-06   74.197395  157.001373  ...  12582500.0  2425500.0
    2020-01-07   73.848442  155.569855  ...  13712900.0  3090800.0
    2020-01-08   75.036385  158.047836  ...  14632400.0  4346000.0
    

    Where the column has two layers of names. First, the attributes then the second layer of the tickers.

    If we only want work with the Adj Close values, which is often the case, we can access them as follows.

    data = data['Adj Close']
    

    Giving data in the following format.

    Symbols           AAPL        MSFT       TWTR         IBM
    Date                                                     
    2020-01-02   74.333511  158.571075  32.299999  126.975204
    2020-01-03   73.610840  156.596588  31.520000  125.962540
    2020-01-06   74.197395  157.001373  31.639999  125.737526
    2020-01-07   73.848442  155.569855  32.540001  125.821907
    2020-01-08   75.036385  158.047836  33.049999  126.872055
    

    Now that is convenient.

    Step 6: Calculate a portfolio holdings

    Now consider a portfolio as follows.

    portfolios = [.25, .15, .40, .20]
    

    That is, 25%, 15%, 40%, and 20% to AAPL, MSFT, TWTR, and IBM, respectively.

    Assume we have 10000 USD to invest as above.

    (data/data.iloc[0])*portfolios*100000
    

    What happened there. Well, first we normalize the data with data/data.iloc[0]. This was covered in the previous course.

    Then we multiply with the portfolio and the amount we invest.

    This result in the following.

    Symbols             AAPL          MSFT          TWTR           IBM
    Date                                                              
    2020-01-02  25000.000000  15000.000000  40000.000000  20000.000000
    2020-01-03  24756.949626  14813.223758  39034.057216  19840.494087
    2020-01-06  24954.221177  14851.514331  39182.662708  19805.051934
    2020-01-07  24836.860500  14716.100112  40297.215708  19818.342892
    2020-01-08  25236.391776  14950.504296  40928.792592  19983.752826
    

    As we can see the first row, this distributes the money as the portfolio is allocated. Then it shows how ti evolves.

    We can get the sum of the full return as follows.

    np.sum((data/data.iloc[0])*portfolios*100000, axis=1)
    

    Where we show the summary here.

    Date
    2020-01-02    100000.000000
    2020-01-03     98444.724688
    2020-01-06     98793.450150
    2020-01-07     99668.519212
    2020-01-08    101099.441489
                      ...      
    2021-03-10    162763.421409
    2021-03-11    168255.248962
    2021-03-12    167440.137240
    2021-03-15    171199.207668
    2021-03-17    169031.577658
    Length: 303, dtype: float64
    

    As you see, we start with 100000 USD and end with 169031 USD in this case. You might get a bit different result, as you run yours on a later day.

    This is handy to explore a portfolio composition.

    Actually, when we get to Monte Carlo Simulation, this will be handy. There, we will generate multiple random portfolios and calculate the return and risk for each of them, to optimize the portfolio composition.

    A random portfolio can be generated as follows with NumPy.

    weight = np.random.random(4)
    weight /= weight.sum()
    

    Notice, that we generate 4 random numbers (one for each ticker) and then we divide by the sum of them. This ensures the sum of the weights will be 1, hence, representing a portfolio.

    This was the first lesson.

    Want more?

    Check out the full 2.5 course with prepared JuPyter Notebooks to follow along.

    Leave a Reply