What will we cover?
In this tutorial you will learn how to connect to a SQLite Database using pandas.
- What is a SQLite database?
- A few SQLite datasets to play with.
- How to use sqlite3 connector with pandas.
- A few useful SQL statements.
Step 1: What is a SQLite database?
There are different types of databasis, but here we will only learn about the relational database model.
Simply said, a relational database is like a collection of DataFrames with rows of data over the same columns. Each column has a datatype, just like a DataFrame.
What makes a database relational, is, that there are pre-defined relationships between them. The data is organized in one or mor tables (or relations) of columns and rows, with a unique key identifying each row.
Later you will see how these relationships can be used to combine tables.
SQLite database software library that provides a relational database management system. It has a lightweight to setup, administrate, and requires low resources.
Therefore the SQLite database is poplar way to have databases on smartphone, small units, or just sharing data in projects.
Step 2: Get a SQLite dataset
To start work with a SQLite database you need a dataset to work with.
You can download the Dallas Officer-Involved Shootings SQLite database here: Download. We will use this dataset as our example to demonstrate it.
The database has three tables: incidents, officers, and subjects.
Other SQLite datasets
Step 3: Database connector
You need a connector to interact with the database.
We will use the sqlite3 is an interface for SQLite databases. No installation is needed, which also makes it nice to work with SQLite databases.
If you work with other databases there are other connectors.
- MySQL Connector for MySQL
- Psycopg for PostgreSQL
- pymssql for Microsoft MS SQL
Let’s try some code.
import sqlite3
conn = sqlite3.connect('files/dallas-ois.sqlite')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name")
print(cursor.fetchall())
Here we assume that the downloaded file from Step 2 is put in a folder files.
- First, we create a connection to it.
- Then make a cursor.
- Then we execute an SQL statement.
- The we fetch all and print it out.
[('incidents',), ('officers',), ('subjects',)]
You can get a description of the table in SQLite as follows.
print(cursor.execute("PRAGMA table_info(officers)").fetchall())
Notice, we combine execute and fetchall() here.
[(0, 'case_number', 'TEXT', 0, None, 0),
(1, 'race', 'TEXT', 0, None, 0),
(2, 'gender', 'TEXT', 0, None, 0),
(3, 'last_name', 'TEXT', 0, None, 0),
(4, 'first_name', 'TEXT', 0, None, 0),
(5, 'full_name', 'TEXT', 0, None, 0)]
Here we see the column names of the table and the types. All types are TEXT.
Step 4: A small introduction to SQL syntax.
When you use pandas with databases, often you are only interested in getting the tables over in DataFrames.
This, luckily, limits the number of SQL queries you need to master.
Get all data from table.
SELECT * FROM table_name
Sometimes you want to limit the number for rows you get, because the dataset might be huge and it takes time. Hence, while working with the model you want to create, you limit the number of rows from the table.
Here we limit it to only 100 first rows of the table.
SELECT * FROM table_name LIMIT 100
Sometimes you are only interested in specific data, and there is no use to extract all the data from the database. Then you can filter with a WHERE clause in your SQL syntax.
SELECT * FROM table_name WHERE column_name > 1
Step 5: Import data into a DataFrame
To read the data into a DataFrame can be done by using the connection we created.
import pandas as pd
officers = pd.read_sql('SELECT * FROM officers', conn)
Then you have all the data in the DataFrame officers.
print(officers.head())
Step 6: SQL join syntax to combine tables
It can be convenient to combine data directly from the Database.
This can be done by using JOIN syntax.
(INNER) JOIN
: returns records that have matching values in both tables
SELECT * FROM table_1 JOIN table_2 ON table_1.column_name_1=table_2.column_name_2
LEFT JOIN
: returns all records from the left table, and the matched records from the right table
SELECT * FROM table_1 LEFT JOIN table_2 ON table_1.column_name_1=table_2.column_name_2
Let’s try.
officers = pd.read_sql('SELECT * FROM officers JOIN incidents ON officers.case_number=incidents.case_number', conn)
print(officers.head())
Want to learn more?
Want to learn more about Data Science to become a successful Data Scientist?
This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.
- 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
- 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
- 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).