This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and JupyterNotebooks available on GitHub.
In this lesson we will learn how to add new columns calculated from values in other columns in our DataFrame. This is similar to calculate in Excel on data from different columns.
Then we will demonstrate some useful function when working with financial data.
Finally, we will show how to remove (or drop) columns from our DataFrame.
Step 1: Load the data.
As usual we need to load the data into our DataFrame. You can get the CSV file from here.
import pandas as pd
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)
It is always a good habit to inspect the data with data.head() (see the video lesson or the in the Notebook link below the video for expected output).
Step 2: Create new columns in a DataFrame (Pandas)
To create a new column in our data set simply write as follows.
data['Daily chg'] = data['Close'] - data['Open']
The above statement will create a new column named Daily chg with the difference between column Close and Open.
Similarly, you can create a column with the normalized data as follows.
labels=[‘High’, ‘Low’, ‘Adj Close’, ‘Volume’] sets the labels of the columns we want to remove.
axis=1 sets the axis of the labels. Default is 0, and will look for the labels on the index. While axis 1 is the column names.
inplace=True says it should actually remove the columns on the DataFrame we work on. Otherwise it will return a new DataFrame without the columns.
What is next?
Want to learn more?
This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and JupyterNotebooks available on GitHub.
In the first lesson we learnt how to load data into a DataFrame. This part will show how to work with each column in the DataFrame. The columns are represented by a different data type, called Series.
n this lesson we will learn how to make calculations on the columns. The columns are represented by a data type called Series.
Each column in a DataFrame is a Series and can be easily accessed. Also, it is easy to calculate new Series of data. This is similar to calculate now columns of data in an Excel sheet.
import pandas as pd
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)
Step 2: Explore the data and data type
In the video we explore the data to ensure it is correct. You can do that by using data.head().
Then we investigate the data type of the columns of the DataFramedata.
data.dtypes
Which results in the following.
Open float64
High float64
Low float64
Close float64
Adj Close float64
Volume int64
dtype: object
This means shows that each column has one data type. Here Open is float64. This is one difference from Excel sheets, where each cell has a data type. The advantage of restricting a data type per column is speed.
The data type of data is DataFrame.
type(data)
The build in function type(…) gives you the type. It is handy to use it when exploring data.
pandas.core.frame.DataFrame
Notice that it is given by a long string pandas.core.frame.DataFrame, this is the structure of the library Pandas.
The data type of a column in a DataFrame can be found by.
type(data['Close'])
Where data[‘Close’] gives access to column Close in the DataFramedata.
pandas.core.series.Series
Where we see a column is represented as a Series. The is similar to a DataFrame that it has an index. E.g. the Series data[‘Close’] has the same index as the DataFrame data. This is handy when you need to work with the data as you will see in a moment.
Step 3: Calculating with Series
To keep it simple, we will start by the daily difference from open and close.
daily_chg = data['Open'] - data['Close']
This calculates a Series daily_chg with the opening price minus the closing price.
Please explore the full data in daily_chg with the data in data.
Where we calculate the daily percentage change. In the calculation above we have limited us to only use data on the same rows (same dates). Later we will learn how to do it with data from previous day (the row above).
Step 4: Normalize stock data
Now we will normalize the data by using the iloc we learned about in previous lesson.
norm = data['Close']/data['Close'].iloc[0]
The above statements calculates a Series norm where the Close price is normalized by dividing by the first available Close price, accessed by using iloc[0].
This results in that norm.iloc[0] will be 1.0000 and norm.iloc[-1] we show the return of this particular stock if invested in on day 1 (index 0) and sold on the day of the last index (index -1), in the case of the video: 1.839521.
Next step?
Want to learn more?
This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and JupyterNotebooks available on GitHub.
In this tutorial we will get familiar to work with DataFrames – the primary data structure in Pandas.
We will learn how to read a historical stock price data from Yahoo! Finance and load it into a DataFrame. This will be done by exporting a CSV file from Yahoo! Finance and load the data. Later we will learn how to read the data directly from the Yahoo! Finance API.
A DataFrame is similar to an Excel sheet. DataFrames can contain data in a similar way as we will see in this lesson.
Then we will learn how to use the index of the dates. This will be necessary later when we make calculations later on.
The first part of the tutorial will give the foundation of what you need to know about DataFrames for financial analysis.
Step 1: Read the stock prices from Yahoo! Finance as CSV
In this first lesson we will download historical stock prices from Yahoo! Finance as CSV file and import them into our Jupyter notebook environment in a DataFrame.
If you are new to CSV files and DataFrames. Don’t worry, that is what we will cover here.
Let’s start by going to Yahoo! Finance and download the CVS file. In this course we have used Apple, but feel free to make similar calculation on a stock of your choice.
Go to Yahoo! Finance write AAPL (ticker for Apple) and press Historical Data and download the CSV data file.
The CSV data file will contain Comma Separated Values (CSV) similar to this.
The first line shows the column names (Date, Open, High, Low, Close, Adj Close, Volume). Then each line contains a data entry for a given day.
Step 2: Read the stock prices from CSV to Pandas DataFrame
n Jupyter Notebook start by importing the Pandas library. This is needed in order to load the data into a DataFrame.
import pandas as pd
data = pd.read_csv("AAPL.csv", index_col=0, parse_dates=True)
data.head()
The read_csv(…) does all the magic for us. It will read the CSV file AAPL.csv. The AAPL.csv file is the one you downloaded from Yahoo! Finance (or from the zip-file downloaded above) and needs to be located in the same folder you are working from in your Jupyter notebook.
The arguments in read_csv(…) are the following.
index_col=0 this sets the first column of the CSV file to be the index. In this case, it is the Date column.
parse_dates=True this ensures that dates in the CSV file are interpreted as dates. This is important if you want to take advantage of the index being a time.
Step 3: Explore data types of columns and index
In the video lesson we explore the type of columns and index.
data.dtypes
data.index
Which will reveal the data type and index of the DataFrame. Notice, that each column has its own data type.
Step 4: Indexing and slicing with DataFrames
We can use loc to lookup an index with a date.
data.loc['2020-01-27']
This will show the data for that specific date. If you get an error it might be because your dataset does not contain the above date. Choose another one to see something similar to this.
Open 7.751500e+01
High 7.794250e+01
Low 7.622000e+01
Close 7.723750e+01
Adj Close 7.657619e+01
Volume 1.619400e+08
Name: 2020-01-27 00:00:00, dtype: float64
A more advanced option is to use an interval (or slice as it is called). Slicing with loc on a DataFrame is done by using a starting and ending index .loc[start:end] or an open ended index .loc[start:], which will take data beginning from start to the last data.
data.loc['2021-01-01':]
This will give all the data starting from 2020-01-01. Notice, that there is no data on January 1st, but since the index is interpreted as a datetime, it can figure out the first date after.
Open High Low Close Adj Close Volume
Date
2021-01-04 133.520004 133.610001 126.760002 129.410004 129.410004 143301900
2021-01-05 128.889999 131.740005 128.429993 131.009995 131.009995 97664900
2021-01-06 127.720001 131.050003 126.379997 126.599998 126.599998 155088000
2021-01-07 128.360001 131.630005 127.860001 130.919998 130.919998 109578200
2021-01-08 132.429993 132.630005 130.229996 132.050003 132.050003 105158200
2021-01-11 129.190002 130.169998 128.500000 128.979996 128.979996 100620900
Similarly, you can create slicing with an open-ended start.
data.loc[:'2020-07-01']
Another important way to index into DataFrames is by iloc[], which does it with index.
data.iloc[0]
data.iloc[-1]
Where you can index from the start with index 0, 1, 2, 3, … Or from the end -1, -2, -3, -4, …
What is next?
Want to learn more?
This is part of the FREE online course on my page. No signup required and 2 hours of free video content with code and JupyterNotebooks available on GitHub.