What will we cover?
In this tutorial you will learn to load CSV, Excel and Parquet files into a pandas DataFrame.
Step 1: Load CSV file into a pandas DataFrame
A CSV file is a Comma-Separated Values file.
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. (wiki).Wikipedia
A CSV file is a typical way to store tabular data in plain text. Each line will have the same number of fields.
This resembles a tabel in a Database.
Because of simplicity, the CSV files is a common exchange format. They are easy to use.
Also, see this lesson to learn about CSV files.
To use pandas to read a CSV file. See the following example.
import pandas as pd file = 'https://raw.githubusercontent.com/LearnPythonWithRune/DataScienceWithPython/main/files/aapl.csv' data = pd.read_csv(file, parse_dates=True, index_col='Date') print(data.head())
You can change file to any CSV file on your storage. Here we use one from GitHub.
A few good parameters.
- parse_dates=True: will parse dates and keep them as datetime object in the DataFrame. This is convenient if you want to take advantage of using the DataFrame with, for example, a DatetimeIndex.
- index_coll=’Date’: You can use an integer to set the index of the column or the column name. This sets the index.
- sep=’;’: This can set a different separator than the default.
- Full documentation
read_csv(): read a comma-separated values (csv) file into pandas DataFrame.
Step 2: Load Excel files into a pandas DataFrame
Do we need to introduce what an Excel file is?
import pandas as pd file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.xlsx?raw=true' data = pd.read_excel(file, index_col='Date') print(data.head())
You can change file to point at any Excel file on your computer.
read_excel()Read an Excel file into a pandas DataFrame.
- index_col=’Date’: Works the same way as for read_csv().
Step 3: Load Parquet file into a pandas DataFrame
Many do not know Parquet files. A Parquet file is a free open source format.
The advantage of Parquet file is, that it is compressed and you can filter the rows while reading from the file.
import pandas as pd file = 'https://github.com/LearnPythonWithRune/DataScienceWithPython/blob/main/files/aapl.parquet?raw=true' data = pd.read_parquet(file) print(data.head())
Notice, that here the index is set by the Parquet file. Hence, you do not need to set it.
read_parquet()Load a parquet object from the file path, returning a DataFrame.
Great Places to Find Data
Now you are hooked and want to find a lot of great datasets to work with.
- UC Irvine Machine Learning Repository!
- KD Nuggets Datasets for Data Mining, Data Science, and Machine Learning
- data.gov The home of the U.S. Government’s open data
- data.gov.uk Data published by central government
- World Health Organization Explore a world of health data
- World Bank source of world data
- Kaggle is an online community of data scientists and machine learning practitioners.
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).