What will we cover?
In this tutorial you will learn to load CSV, Excel and Parquet files into a pandas DataFrame.
Step 1: Load CSV file into a pandas DataFrame
A CSV file is a Comma-Separated Values file.
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. (wiki).
Wikipedia
A CSV file is a typical way to store tabular data in plain text. Each line will have the same number of fields.
This resembles a tabel in a Database.
Because of simplicity, the CSV files is a common exchange format. They are easy to use.
Also, see this lesson to learn about CSV files.
To use pandas to read a CSV file. See the following example.
import pandas as pd
file = 'https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/aapl.csv'
data = pd.read_csv(file, parse_dates=True, index_col='Date')
print(data.head())

You can change file to any CSV file on your storage. Here we use one from GitHub.
A few good parameters.
- parse_dates=True: will parse dates and keep them as datetime object in the DataFrame. This is convenient if you want to take advantage of using the DataFrame with, for example, a DatetimeIndex.
- index_coll=’Date’: You can use an integer to set the index of the column or the column name. This sets the index.
- sep=’;’: This can set a different separator than the default.
- Full documentation
read_csv()
: read a comma-separated values (csv) file into pandas DataFrame.
Step 2: Load Excel files into a pandas DataFrame
Do we need to introduce what an Excel file is?
- Most widely used spreadsheet.
- Learn more about Excel processing in this lecture.
import pandas as pd
file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.xlsx?raw=true'
data = pd.read_excel(file, index_col='Date')
print(data.head())




You can change file to point at any Excel file on your computer.
read_excel()
Read an Excel file into a pandas DataFrame.- index_col=’Date’: Works the same way as for read_csv().
Step 3: Load Parquet file into a pandas DataFrame
Many do not know Parquet files. A Parquet file is a free open source format.
The advantage of Parquet file is, that it is compressed and you can filter the rows while reading from the file.
import pandas as pd
file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.parquet?raw=true'
data = pd.read_parquet(file)
print(data.head())




Notice, that here the index is set by the Parquet file. Hence, you do not need to set it.
read_parquet()
Load a parquet object from the file path, returning a DataFrame.
Great Places to Find Data
Now you are hooked and want to find a lot of great datasets to work with.
- UC Irvine Machine Learning Repository!
- KD Nuggets Datasets for Data Mining, Data Science, and Machine Learning
- KD Nuggets Government, State, City, Local and Public
- KD Nuggets APIs, Hubs, Marketplaces, and Platforms
- KD Nuggets Analytics, Data Science, Data Mining Competitions
- data.gov The home of the U.S. Government’s open data
- data.gov.uk Data published by central government
- World Health Organization Explore a world of health data
- World Bank source of world data
- Kaggle is an online community of data scientists and machine learning practitioners.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).