Get Started with pandas for Data Science

Why use pandas?

When working with tabular data (spreadsheets, databases, etc) pandas is the right tool. pandas makes it easy to acquire, explore, clean, process, analyze, and visualize your data.

This basically covers the full Data Science Workflow.

Data Science Workflow

Great community

pandas has a big community with a lot of help, which is essential when you choose your main library to handle data.

It is no secret that pandas is a large tool and at times can seem complex. Actually, pandas can do (almost) everything with data – you could say, if you can do it in Excel, you can certainly do it pandas and even more automatically.

Getting started with pandas

If you use JuPyter Notebooks pandas is installed by default. If you use another framework, you can install pandas as follows in a terminal.

pip install pandas

Now let’s get started with pandas.

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/aapl.csv', parse_dates=True, index_col=0)
data.head()

This should output the following in your Notebook.

If you use another framework and nothing shows up. Then you should change the last line to.

print(data.head())

Index and Columns

You can get the index of the DataFrame by using .index. This will give the index column.

data.index

Also, you can get all the column names as follows.

data.columns

Column data type of DataFrame

You can get the data types of the columns of a DataFrame as follows.

data.dtypes

The Size and Shape of Data

You get the number of rows of data by using len(data)

len(data)

Here it will print 472.

Also you can get the shape of data, which is the number of rows and columns.

data.shape

Which would give (472, 6).

Slicing rows and columns

A DataFrame can be used to select (or filter) data in many ways. This is often called slicing. Below we give a few examples, which cover most common cases.

  • data['Close']: Select one column (Series)
  • data[['Open', 'Close']]: Select multiple columns with specific names
  • data.loc['2020-05-01':'2021-05-01']: Select all columns between the dates (including 2021-05-01)
  • data.iloc[50:55]: Select all columns between rows 50-55 (excluding 55)

First let’s try to select one column in the DataFrame. This will return a Series, which is another data structure in pandas. A Series is just a list of data using the same index as a the DataFrame.

data['Close']

You can also select two columns or more columns, using a list (the square brackets []) inside it. Here is an example.

data[['Close', 'Open']]

If you want data from a range of rows, you can use the index. Here you need to specify the index. Also notice that the from and to index are both included.

data.loc['2021-05-03':'2021-05-14']

In this case we use a DatetimeIndex, hence, we can list all data for a given day, month, year or similar as follows.

data.loc['2021-05']

Sometimes we do not want to use the index type, then we can use an integer index as follows.

data.iloc[50:55]

Arithmetic operations

Like with an Excel sheet, you want to make calculations on columns of data. This can be done simple as the following example shows. Notice, that you can create a new column easily in your DataFrame.

  • Calculating with columns on all rows
    • Example: data['Close'] - data['Open']
  • Creating new columns
    • Example: data['New'] = data['Open'] - data['Close']
data['New'] = data['Open'] - data['Close']

Select data

When you want to filter data based on value, you can do it as the following example shows.

  • Select data based boolean expressions
    • Example: data['New'] > 0
    • Example: data[data['New'] > 0]
data[data['New'] > 0]

Groupby and value_counts

A great thing about DataFrames is how easy it is to group data. Here we make an example, which might not make sense, but it is just to illustrate it.

  • Exampledata['Category'] = data['New'] > 0 data.groupby('Category').mean()
  • Exampledata['Category'].value_counts() (data['New'] > 0).value_counts()
data['Category'] = data['New'] > 0
data.groupby('Category').mean()
data['Category'].value_counts()
(data['New'] > 0).value_counts()

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

Leave a Reply Cancel reply

Exit mobile version