Load files with pandas: CSV and Excel and Parquet files

What will we cover?

In this tutorial you will learn to load CSV, Excel and Parquet files into a pandas DataFrame.

Step 1: Load CSV file into a pandas DataFrame

A CSV file is a Comma-Separated Values file.

comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.  (wiki).

Wikipedia

A CSV file is a typical way to store tabular data in plain text. Each line will have the same number of fields.

This resembles a tabel in a Database.

Because of simplicity, the CSV files is a common exchange format. They are easy to use.

Also, see this lesson to learn about CSV files.

To use pandas to read a CSV file. See the following example.

import pandas as pd
file = 'https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/aapl.csv'
data = pd.read_csv(file, parse_dates=True, index_col='Date')
print(data.head())

You can change file to any CSV file on your storage. Here we use one from GitHub.

A few good parameters.

  • parse_dates=True: will parse dates and keep them as datetime object in the DataFrame. This is convenient if you want to take advantage of using the DataFrame with, for example, a DatetimeIndex.
  • index_coll=’Date’: You can use an integer to set the index of the column or the column name. This sets the index.
  • sep=’;’: This can set a different separator than the default.
  • Full documentation read_csv(): read a comma-separated values (csv) file into pandas DataFrame.

Step 2: Load Excel files into a pandas DataFrame

Do we need to introduce what an Excel file is?

import pandas as pd
file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.xlsx?raw=true'
data = pd.read_excel(file, index_col='Date')
print(data.head())

You can change file to point at any Excel file on your computer.

  • read_excel() Read an Excel file into a pandas DataFrame.
  • index_col=’Date’: Works the same way as for read_csv().

Step 3: Load Parquet file into a pandas DataFrame

Many do not know Parquet files. A Parquet file is a free open source format.

The advantage of Parquet file is, that it is compressed and you can filter the rows while reading from the file.

import pandas as pd
file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.parquet?raw=true'
data = pd.read_parquet(file)
print(data.head())

Notice, that here the index is set by the Parquet file. Hence, you do not need to set it.

  • read_parquet() Load a parquet object from the file path, returning a DataFrame.

Great Places to Find Data

Now you are hooked and want to find a lot of great datasets to work with.

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Leave a Reply

%d bloggers like this: