When working with tabular data (spreadsheets, databases, etc) pandas is the right tool. pandas makes it easy to acquire, explore, clean, process, analyze, and visualize your data.
This basically covers the full Data Science Workflow.
pandas has a big community with a lot of help, which is essential when you choose your main library to handle data.
It is no secret that pandas is a large tool and at times can seem complex. Actually, pandas can do (almost) everything with data – you could say, if you can do it in Excel, you can certainly do it pandas and even more automatically.
If you use JuPyter Notebooks pandas is installed by default. If you use another framework, you can install pandas as follows in a terminal.
pip install pandas
Now let’s get started with pandas.
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/aapl.csv', parse_dates=True, index_col=0)
data.head()
This should output the following in your Notebook.
If you use another framework and nothing shows up. Then you should change the last line to.
print(data.head())
You can get the index of the DataFrame by using .index. This will give the index column.
data.index
Also, you can get all the column names as follows.
data.columns
You can get the data types of the columns of a DataFrame as follows.
data.dtypes
You get the number of rows of data by using len(data)
len(data)
Here it will print 472.
Also you can get the shape of data, which is the number of rows and columns.
data.shape
Which would give (472, 6).
A DataFrame can be used to select (or filter) data in many ways. This is often called slicing. Below we give a few examples, which cover most common cases.
data['Close']
: Select one column (Series)data[['Open', 'Close']]
: Select multiple columns with specific namesdata.loc['2020-05-01':'2021-05-01']
: Select all columns between the dates (including 2021-05-01)data.iloc[50:55]
: Select all columns between rows 50-55 (excluding 55)First let’s try to select one column in the DataFrame. This will return a Series, which is another data structure in pandas. A Series is just a list of data using the same index as a the DataFrame.
data['Close']
You can also select two columns or more columns, using a list (the square brackets []) inside it. Here is an example.
data[['Close', 'Open']]
If you want data from a range of rows, you can use the index. Here you need to specify the index. Also notice that the from and to index are both included.
data.loc['2021-05-03':'2021-05-14']
In this case we use a DatetimeIndex, hence, we can list all data for a given day, month, year or similar as follows.
data.loc['2021-05']
Sometimes we do not want to use the index type, then we can use an integer index as follows.
data.iloc[50:55]
Like with an Excel sheet, you want to make calculations on columns of data. This can be done simple as the following example shows. Notice, that you can create a new column easily in your DataFrame.
data['Close'] - data['Open']
data['New'] = data['Open'] - data['Close']
data['New'] = data['Open'] - data['Close']
When you want to filter data based on value, you can do it as the following example shows.
data['New'] > 0
data[data['New'] > 0]
data[data['New'] > 0]
A great thing about DataFrames is how easy it is to group data. Here we make an example, which might not make sense, but it is just to illustrate it.
data['Category'] = data['New'] > 0 data.groupby('Category').mean()
data['Category'].value_counts() (data['New'] > 0).value_counts()
data['Category'] = data['New'] > 0
data.groupby('Category').mean()
data['Category'].value_counts()
(data['New'] > 0).value_counts()
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
Project Description The Fibonacci sequence is as follows. 0 1 1 2 3 5 8…
How ELIZA works? It looks for simple patterns and substitutes to give the illusion of…
Project Description The program you write can do 4 things. It can show the content…
Project Description You will start to sell items from your awesome store. You count items…
Project Description Create a converter from Fahrenheit to celsius using the formula °𝐶=(°𝐹−32)×5/9 Project Prompt…
Project Description Leet (or "1337"), also known as eleet or leetspeak, is a system of…