NumPy vs Pandas

What will we cover in this tutorial

A high level view of the differences of NumPy and Pandas libraries in Python. We will also make a short exploration of the performance differences in a specific use case.

Top level differences between NumPy and Pandas

First of all, the purpose of these libraries are different.

  • NumPy is made to manage n-dimensional numerical data. Think of it if you need to handle a lot of data all of the same type and numerical, but categorized in columns and rows.
  • Pandas is made for tabular data. This could be data from an excel sheet, where you have various types of data categorized in rows and columns.

There are more differences.

  • NumPy consist of the data type ndarray, which is create with fixed dimensions with only one element type.
  • Pandas consist of Series and DataFrames, which are more dynamic after creation.

Performance comparison of NumPy and Pandas

If you should guess? Pandas? Of course not. NumPy is great magnitude faster than Pandas.

Why?

Let us first examine it.

import time
import numpy as np
import pandas as pd

size = 100
iterations = 100000000//size

a = np.arange(size)
start = time.time()
for _ in range(iterations): a2 = a * a
end = time.time()
print(end - start)

n = pd.Series(a)
start = time.time()
for _ in range(iterations): n2 = n * n
end = time.time()
print(end - start)

Which results in the following comparison.

NumPy vs Pandas

I find it very interesting that the speed is so slow for small instances of Pandas, comparing to NumPy, while later it seems to go to Pandas advantage, but eventually it still seems to be NumPy.

Well, the flexibility of Pandas has a cost, which is high for small instances when making arithmetic operations as we did in the above example.

Next steps

Investigate further how NumPy and Pandas compare in performance for various functions.

Pandas and NumPy support a lot of functions in a vectorized way, which could be interesting to investigate. Do the restrictions of NumPy arrays give the underlying C/C++ code an advantage in performance?

Quick NumPy Tutorial

What is NumPy?

NumPy is a scientific library that provides multidimensional array object with fast routines on it. NumPy is short for Numerical Python.

When we talk about NumPy often we refer to the powerful ndarray, which is the multidimensional array (N-dimensional array).

A few comparisons between Python lists and ndarray.

ndarrayPython list
Have fixed size at creation.Is dynamic. You can add and remove elements.
All elements have the same type.Elements have type independent of each other.
Can execute fast mathematical operations with simple syntax.Need loops to make operations on each element.
Comparison between Numpy and Python list

Examples showing the difference: Fixed after creation

The Numpy is imported by default imported by import numpy as np. To create a ndarray, you can use the array call as defined below.

import numpy as np

data = np.array([[1, 2, 3], [1, 2, 3]])
print(data)

Which will create an 2 dimensional array object with 2 times 3 elements.

[[1 2 3]
 [1 2 3]]

That will be a fixed sized ndarray. You cannot add new dimensions or elements to the the single arrays.

A Python list is a more flexible.

my_list = []
my_list.append(2)
my_list.append(4)
my_list.remove(2)
print(my_list)

Which demonstrates the flexibility and power of Python lists. It is simple to add and remove elements. The above code will result in the following output.

[4]

Examples showing the difference: One type

The type of a ndarray is stored in dtype. Interesting thing is that each element must have the same type.

import numpy as np

data = np.random.randn(2, 3)

print(data)
print(data.dtype)

It will result in a random ndarray of type float64.

[[-0.85925182 -0.89247774 -2.40920842]
 [ 0.84647869  0.27631307 -0.80772023]]
float64

An interesting way to demonstrate that only one type can be present in an ndarray, is by trying to create it with a mixture of ints and floats.

import numpy as np

data = np.array([[1.0, 2, 3], [1, 2, 3]])
print(data)
print(data.dtype)

As the first element is of type float they are all cast to float64.

[[1. 2. 3.]
 [1. 2. 3.]]
float64

While the following list is valid.

my_list = [1.0, 2, 3]
print(my_list)

Where the first element will be float the second and third element are ints.

[1.0, 2, 3]

Examples showing the difference: No loops needed

Many operations can be made directly on the ndarray.

import numpy as np

data = np.random.randn(2, 3)

print(data)
print(data*10)
print(data + data)

Which will result in the following output.

[[ 1.18303358 -2.20017954  0.46294824]
 [-0.56508587  0.0990272  -1.8431866 ]]
[[ 11.83033584 -22.00179538   4.62948243]
 [ -5.65085867   0.990272   -18.43186601]]
[[ 2.36606717 -4.40035908  0.92589649]
 [-1.13017173  0.1980544  -3.6863732 ]]

Expected right? But easy to multiply and add out of the box.

Similar of the Python list would be.

my_list = [1, 2, 3]
for i in range(len(my_list)):
    my_list[i] *= 10

for i in range(len(my_list)):
    my_list[i] += my_list[i]

And it is not even the same, as you write it directly to the old elements.

Another way to compare differences

It might at first glance seem like ndarrays are inflexible with all the restrictions comparing the Python lists. Yes, that is true, but the benefit is the speed.

import time
import numpy as np

my_arr = np.arange(1000000)
my_list = list(range(1000000))

start = time.time()
for _ in range(10): my_arr2 = my_arr * 2
end = time.time()
print(end - start)

start = time.time()
for _ in range(10): my_list2 = [x * 2 for x in my_list]
end = time.time()
print(end - start)

Which resulted in.

0.03456306457519531
0.9373760223388672

The advantage is that ndarrays are 10-100 times faster than Python lists, which makes a considerable impact on scientific calculations.