Pandas: Calculate the Stochastic Oscillator Indicator for Stocks

What is the Stochastic Oscillator Indicator for a stock?

A stochastic oscillator is a momentum indicator comparing a particular closing price of a security to a range of its prices over a certain period of time.

https://www.investopedia.com/terms/s/stochasticoscillator.asp

The stochastic oscillator is an indicator for the speed and momentum of the price. The indicator changes direction before the price does and is therefore a leading indicator.

Step 1: Get stock data to do the calculations on

In this tutorial we will use the Apple stock as example, which has ticker AAPL. You can change to any other stock of your interest by changing the ticker below. To find the ticker of your favorite company/stock you can use Yahoo! Finance ticker lookup.

To get some time series of stock data we will use the Pandas-datareader library to collect it from Yahoo! Finance.

import pandas_datareader as pdr
import datetime as dt

ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())
print(ticker)

Where we only focus on data from 2020 until today.

                  High         Low  ...      Volume   Adj Close
Date                                ...                        
2020-01-02  300.600006  295.190002  ...  33870100.0  298.292145
2020-01-03  300.579987  296.500000  ...  36580700.0  295.392120
2020-01-06  299.959991  292.750000  ...  29596800.0  297.745880
2020-01-07  300.899994  297.480011  ...  27218000.0  296.345581
2020-01-08  304.440002  297.160004  ...  33019800.0  301.112640
...                ...         ...  ...         ...         ...
2020-08-05  441.570007  435.589996  ...  30498000.0  439.457642
2020-08-06  457.649994  439.190002  ...  50607200.0  454.790009
2020-08-07  454.700012  441.170013  ...  49453300.0  444.450012
2020-08-10  455.100006  440.000000  ...  53100900.0  450.910004
2020-08-11  449.929993  436.429993  ...  46871100.0  437.500000
[154 rows x 6 columns]

The output does not show all the columns, which are: High, Low, Open, Close, Volume, and Adj Close.

Step 2: Understand the calculation of Stochastic Oscillator Indicator

The Stochastic Oscillator Indicator consists of two values calculated as follows.

%K = (Last Close – Lowest low) / (Highest high – Lowest low)

%D = Simple Moving Average of %K

What %K looks at is the Lowest low and Highest high in a window of some days. The default is 14 days, but can be changed. I’ve seen others use 20 days, as the stock market is open 20 days per month. The original definition set it to 14 days. The simple moving average was set to 3 days.

The numbers are converted to percentage, hence the indicators are in the range of 0% to 100%.

The idea is that if the indicators are above 80%, it is considered to be in the overbought range. While if it is below 20%, then it is considered to be oversold.

Step 3: Calculate the Stochastic Oscillator Indicator

With the above description it is straight forward to do.

import pandas_datareader as pdr
import datetime as dt

ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())
ticker['14-high'] = ticker['High'].rolling(14).max()
ticker['14-low'] = ticker['Low'].rolling(14).min()
ticker['%K'] = (ticker['Close'] - ticker['14-low'])*100/(ticker['14-high'] - ticker['14-low'])
ticker['%D'] = ticker['%K'].rolling(3).mean()
print(ticker)

Resulting in the following output.

                  High         Low  ...         %K         %D
Date                                ...                      
2020-01-02  300.600006  295.190002  ...        NaN        NaN
2020-01-03  300.579987  296.500000  ...        NaN        NaN
2020-01-06  299.959991  292.750000  ...        NaN        NaN
2020-01-07  300.899994  297.480011  ...        NaN        NaN
2020-01-08  304.440002  297.160004  ...        NaN        NaN
...                ...         ...  ...        ...        ...
2020-08-05  441.570007  435.589996  ...  92.997680  90.741373
2020-08-06  457.649994  439.190002  ...  97.981589  94.069899
2020-08-07  454.700012  441.170013  ...  86.939764  92.639677
2020-08-10  455.100006  440.000000  ...  93.331365  92.750906
2020-08-11  449.929993  436.429993  ...  80.063330  86.778153
[154 rows x 10 columns]

Please notice that we have not included all columns here. Also, see the the %K and %D are not available for the first days, as it needs 14 days of data to be calculated.

Step 4: Plotting the data on a graph

We will combine two graphs in one. This can be easily obtained using Pandas DataFrames plot function. The argument secondary_y can be used to plot up against two y-axis.

The two lines %K and %D are both on the same scale 0-100, while the stock prices are on a different scale depending on the specific stock.

To keep things simple, we also want to plot a line indicator of the 80% high line and 20% low line. This can be done by using the axhline from the Axis object that plot returns.

The full code results in the following.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt

ticker = pdr.get_data_yahoo("AAPL", dt.datetime(2020, 1, 1), dt.datetime.now())
ticker['14-high'] = ticker['High'].rolling(14).max()
ticker['14-low'] = ticker['Low'].rolling(14).min()
ticker['%K'] = (ticker['Close'] - ticker['14-low'])*100/(ticker['14-high'] - ticker['14-low'])
ticker['%D'] = ticker['%K'].rolling(3).mean()
ax = ticker[['%K', '%D']].plot()
ticker['Adj Close'].plot(ax=ax, secondary_y=True)
ax.axhline(20, linestyle='--', color="r")
ax.axhline(80, linestyle="--", color="r")
plt.show()

Resulting in the following graph.

Apple with Stochastic Oscillator Indicator %K and %D

Step 5: Interpreting the signals.

First a word of warning. Most advice from only using one indicator alone as a buy-sell signal. This also holds for the Stochastic Oscillator indicator. As the name suggest, it is only an indicator, not a predictor.

The indicator signals buy or sell when the two lines crosses each other. If the %K is above the %D then it signals buy and when it crosses below, it signals sell.

Looking at the graph it makes a lot of signals (every time the two lines crosses each other). This is a good reason to have other indicators to rely on.

An often misconception is that it should only be used when it is in the regions of 20% low or 80% high. But it is often that low and high can be for quite some time. Hence, selling if we reach the 80% high in this case, we would miss a great opportunity of a big gain.

12% Investment Solution

Would you like to get 12% in return of your investments?

D. A. Carter promises and shows how his simple investment strategy will deliver that in the book The 12% Solution. The book shows how to test this statement by using backtesting.

Did Carter find a strategy that will consistently beat the market?

Actually, it is not that hard to use Python to validate his calculations. But we can do better than that. If you want to work smarter than traditional investors then continue to read here.

Pandas: Calculate the Moving Average Convergence Divergence (MACD) for a Stock

What is the Moving Average Convergence Divergence (MACD) Indicator?

Moving Average Convergence Divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of a security’s price. The MACD is calculated by subtracting the 26-period Exponential Moving Average (EMA) from the 12-period EMA.

https://www.investopedia.com/terms/m/macd.asp

That is easy to understand, right?

The good news is that it is easy to calculate using the Pandas DataFrames.

Well, the MACD is a technical indicator that helps to understand if it is a bullish or bearish market. That is, it can help the investor to understand if he should buy or sell the stock.

Step 1: Get the historic time series stock price data

A great source to get historic stock price data is by using the Pandas-datareader library to collect it.

import pandas_datareader as pdr
import datetime as dt
start = dt.datetime(2020, 1, 1)
end = dt.datetime.now()
ticker = pdr.get_data_yahoo("AAPL", start, end)['Adj Close']
print(ticker)

Here we collect it for Apple (ticker AAPL) from the beginning of the year 2020.

Date
2020-01-02    298.292145
2020-01-03    295.392120
2020-01-06    297.745880
2020-01-07    296.345581
2020-01-08    301.112640
                 ...    
2020-08-05    439.457642
2020-08-06    454.790009
2020-08-07    444.450012
2020-08-10    450.910004
2020-08-11    444.095001
Name: Adj Close, Length: 154, dtype: float64

Note that we only keep the Adjusted Close (Adj Close) column to make our calculations.

The Adjusted Close is adjusted for stock splits, dividend payout and other cooperate operations that affect the price (read more on Investopedia.org).

Step 2: Make the MACD calculations

The formula for MACD = 12-Period EMA − 26-Period EMA (source)

As the description says, we need the Exponential Moving Averages (EMA) for a 12-days and 26-days window.

Luckily, the Pandas DataFrame provides a function ewm(), which together with the mean-function can calculate the Exponential Moving Averages.

exp1 = ticker.ewm(span=12, adjust=False).mean()
exp2 = ticker.ewm(span=26, adjust=False).mean()
macd = exp1 - exp2

But more is needed. We need to make a signal line, which is also defined.

A nine-day EMA of the MACD called the “signal line,” is then plotted on top of the MACD line, which can function as a trigger for buy and sell signals.

https://www.investopedia.com/terms/m/macd.asp

Hence, we end up with the following.

exp1 = ticker.ewm(span=12, adjust=False).mean()
exp2 = ticker.ewm(span=26, adjust=False).mean()
macd = exp1 - exp2
exp3 = macd.ewm(span=9, adjust=False).mean()

Step 3: Plot the data

We need to plot two y-scales for the plot. One for the MACD and the 9 day EMA of MACD. And one for the actually stock price.

Luckily the Pandas plot method supports having two y-axis.

macd.plot(label='AAPL MACD', color='g')
ax = exp3.plot(label='Signal Line', color='r')
ticker.plot(ax=ax, secondary_y=True, label='AAPL')

As you see, the first two calls to plot use the same axis (the left side) and the final one on ticker, uses the secondary_y (the right side axis).

Then we need to setup labels, legends, and names on axis.

ax.set_ylabel('MACD')
ax.right_ax.set_ylabel('Price $')
ax.set_xlabel('Date')
lines = ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, [l.get_label() for l in lines], loc='upper left')

The variable lines collects the lines plotted on both y-axis and then makes the legend. This is needed, otherwise only the last legend will be visible.

All together it becomes.

import pandas_datareader as pdr
import datetime as dt
import matplotlib.pyplot as plt
start = dt.datetime(2020, 1, 1)
end = dt.datetime.now()
ticker = pdr.get_data_yahoo("AAPL", start, end)['Adj Close']
exp1 = ticker.ewm(span=12, adjust=False).mean()
exp2 = ticker.ewm(span=26, adjust=False).mean()
macd = exp1 - exp2
exp3 = macd.ewm(span=9, adjust=False).mean()
macd.plot(label='AAPL MACD', color='g')
ax = exp3.plot(label='Signal Line', color='r')
ticker.plot(ax=ax, secondary_y=True, label='AAPL')
ax.set_ylabel('MACD')
ax.right_ax.set_ylabel('Price $')
ax.set_xlabel('Date')
lines = ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, [l.get_label() for l in lines], loc='upper left')
plt.show()

Resulting in the graph.

Result

When the signal line (red one) crosses the MACD (green) line, it is time to sell if the green is below and buy if the green is above.

Notice that this is done on historical data and is no guarantee it will work in the future. While the results look pretty promising, it is not wise to make your investments solely on one indicator.

12% Investment Solution

Would you like to get 12% in return of your investments?

D. A. Carter promises and shows how his simple investment strategy will deliver that in the book The 12% Solution. The book shows how to test this statement by using backtesting.

Did Carter find a strategy that will consistently beat the market?

Actually, it is not that hard to use Python to validate his calculations. But we can do better than that. If you want to work smarter than traditional investors then continue to read here.

Want to learn more?

Check out my FREE online video course with Python for Finance.

Learn Python for Financial Data Analysis with Pandas (Python library) in this 2 hour free 8-lessons online course.

FREE online course

Also, check this FREE course out.

Learn Python for Finance with Risk and Return with Pandas and NumPy (Python libraries) in this 2.5 hour free 8-lessons online course.

Want to skill up on Python programming?

Learn Python – Python for Beginners – is an 8+ hours full video course for beginners to master Python.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE eBook with all the learnings from the lessons.

Pandas Correlation Methods Explained: Pearson, Kendall, and Spearman

What will we cover in this tutorial?

In this tutorial we will on a live example investigate and understand the differences between the 3 methods to calculate correlation using Pandas DataFrame corr() function.

The purpose of this tutorial is to get a better understanding of these correlations, while working on real data.

Step 1: Getting some data to play with

The data we want to investigate for correlations is the US GDP, S&P 500, Gold and Oil prices. We will only focus on recent time (from 2000-2020), as the prices for Gold and Oil are not available further back on Yahoo! finance. We will get the US GDP from World Bank and the rest from Yahoo! finance.

We will be using the Pandas-datareader to retrieve the data. For a more in-depth introduction to how to use them, we will refer you to this tutorial.

import pandas_datareader as pdf
import datetime as dt
from pandas_datareader import wb

start = dt.datetime(2000, 1, 1)
end = dt.datetime.now()
tickers = pdf.get_data_yahoo(["^GSPC", "GC=F", "CL=F"], start, end)['Adj Close']
gdp = wb.download(indicator='NY.GDP.MKTP.CD', country='US', start=2000, end=2019)
gdp = gdp.reset_index(1).set_index('year')
gdp.index = pd.to_datetime(gdp.index, format="%Y")
data = gdp.join(tickers, how='outer')
data = data.interpolate(method='linear')
data = data.dropna()
data.columns = ["US GDP", "S&P 500", "Gold", "Oil"]
print(data)

Resulting in the following output.

Python 3.8.2 (default, Feb 26 2020, 02:56:10)
                  US GDP      S&P 500         Gold        Oil
2000-08-30  1.047113e+13  1502.589966   273.899994  33.400002
2000-08-31  1.047243e+13  1517.680054   278.299988  33.099998
2000-09-01  1.047373e+13  1520.770020   277.000000  33.380001
2000-09-05  1.047503e+13  1507.079956   275.799988  33.799999
2000-09-06  1.047634e+13  1492.250000   274.200012  34.950001
...                  ...          ...          ...        ...
2020-08-05  2.142770e+13  3327.770020  2031.099976  42.189999
2020-08-06  2.142770e+13  3349.159912  2051.500000  41.950001
2020-08-07  2.142770e+13  3351.280029  2046.099976  41.599998
2020-08-09  2.142770e+13  3351.280029  2037.099976  41.590000
2020-08-10  2.142770e+13  3351.280029  2043.900024  41.889999

Where we see the data we want to investigate for correlations.

Step 2: Investigate Pearson correlation coefficients

Looking at the corr() function on DataFrames it calculate the pairwise correlation between columns and returns a correlation matrix.

The default method is the Pearson correlation coefficient method. As we will see in this tutorial, correlations can be calculated differently. The Pearson is trying to correlate through a straight line between the variables.

The best way to understand that is by using an example.

Let’s first calculate the correlation matrix using the Pearson method and then try to visualize it to understand it better. You can get the correlation method simply by calling corr() on the DataFrame.

print(data.corr())

As it is the default method you do not need to set it be pearson. The output will be.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.897376  0.817294  0.237426
S&P 500  0.897376  1.000000  0.581576 -0.015951
Gold     0.817294  0.581576  1.000000  0.534163
Oil      0.237426 -0.015951  0.534163  1.000000

A few words on a correlation matrix. The output of the correlation function is a number from -1 to 1. Some high-level interpretations of the output.

  • -1: A full negative correlation. Meaning if variable goes up, the other variable goes down and they are fully correlated.
  • 0: No correlation at all. Meaning that the two variables are not dependent at all. If one goes up, you cannot predict with any probability what will happen to the other.
  • 1: A full correlation. Meaning if the one variable goes up, so will the other.

Numbers between are just indication how much they are dependet.

Looking at the above output, you see that US GDP fully correlates to US GDP. This is obvious, as it is the same variable. Next we have a 0.897376 correlation between US GDP and S&P 500 stock market index. This tells us that there is a high correlation.

Now to be a bit more specific. This correlation is linear.

That means it can be fitted well with a straight line. Let’s try to visualize that.

import matplotlib.pyplot as plt
import numpy as np

# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['S&P 500'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['US GDP']))
plt.scatter(x=data['US GDP'], y=data['S&P 500'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()

Resulting in the following fit.

Also, let’s investigate something that does not fit well, the US GDP with Oil prices.

import matplotlib.pyplot as plt
import numpy as np

# Pearson fit (default method)
fit = np.polyfit(x=data['US GDP'], y=data['Oil'], deg=1)
line_fit = np.poly1d(fit)
plt.plot(data['US GDP'], line_fit(data['Oil']))
plt.scatter(x=data['US GDP'], y=data['Oil'], color='red', alpha=0.1)
plt.title("Pearson correlation")
plt.show()

As you can see visually, this does not fit as well to a straight line as the above example. The closer the markers are to a fitted straight line, the higher score of the correlation using Pearson. This is independent on the slope of the line, except if the slope is positive (resulting in positive values) or negative (resulting in negative values).

Just some notes to consider about Pearson correlation coefficient. The requirement of the variables being normally distributed is controversial and outside the scope of this tutorial. That said, be careful concluding based on the result. It might be an indicator, but do not conclude any linear correlations or not based on the result.

Step 3: Investigating the Kendall rank correlation coefficients

The Kendall rank correlation coefficient does not assume a normal distribution of the variables and is looking for a monotonic relationship between two variables.

Two variables are monotonic correlated if any greater value of the one variable will result in a greater value of the other variable. If the variables is negatively monotonic correlated, then it is opposite.

The correlation can be calculated as follows.

print(data.corr(method="kendall"))

Resulting in the following output.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.703141  0.685002  0.249430
S&P 500  0.703141  1.000000  0.426406  0.122434
Gold     0.685002  0.426406  1.000000  0.413298
Oil      0.249430  0.122434  0.413298  1.000000

Which interestingly shows that the Pearson correlation coefficient of US GDP and S&P 500 is higher than the Kendall rank correlation.

As a rule thumb, a correlation less than 0.8 (or greater than -0.8) is considered insignificant and not strongly correlated. This means, that the correlation of US GDP and S&P 500 seems to have a linear correlation but not a strong monotonic correlation.

Remember that these are two different measures and can not be directly compared. As they measure different aspects, it is not surprising. The Pearson method can be thought of how close the points are to a fitted line, while the Kendall method looks if the one variable grows, does the other. As you see on the map, this seems not to be the case. There are many instances where it does not happen.

Step 4: Investigating the Spearman rank correlation

Spearman is closely related to Kendall, and measures whether the variables are monotonically correlated.

The Spearman rank correlation can be computed by the following.

print(data.corr(method="spearman"))

And results in the following output.

           US GDP   S&P 500      Gold       Oil
US GDP   1.000000  0.846197  0.837650  0.317295
S&P 500  0.846197  1.000000  0.609104  0.178937
Gold     0.837650  0.609104  1.000000  0.558569
Oil      0.317295  0.178937  0.558569  1.000000

Which actually is a bit more optimistic about the monotonic correlation between the US GDP and S&P 500.

Can we then conclude that when US GDP goes up, the S&P 500 goes up? Good question. The short answer is no. Example that might make it more understandable. In summer time ice cream sales go up. But also, in summer time sun glass sales goes up. Does that mean that higher ice cream sales implies higher sun glass sales? Not really. It is the factor that there is more sun that affect it.

The same can be true for correlations you find in data. Just think of it as an indicator that they somehow might be connected (or not, if value is close to 0).

Step 5: When to use what?

This is a good question.

  • Pearson correlation coefficient is in general considered stronger as has higher assumptions on data. On the negative, it only considers a full linear dependence (fitting to a straight line) and in (theory) requires the variables to be normally distributed. It is very fragile to outliers (single points far away from the norm).
  • Kendall rank correlation coefficient should be more efficient with smaller sets. It measures the monotonic relationship between two variables, and it is a bit slower to calculate O(n^2). It does not require the variables to be normally distributed.
  • Spearman rank correlation coefficient also measures the monotonic relationship between two variables. The speed is faster O(n log(n)). It often gives a slightly higher value than Kendalls. It also does not require the variables to be normally distributed.