What will we cover in this tutorial
Top level differences between NumPy and Pandas
First of all, the purpose of these libraries are different.
- NumPy is made to manage n-dimensional numerical data. Think of it if you need to handle a lot of data all of the same type and numerical, but categorized in columns and rows.
- Pandas is made for tabular data. This could be data from an excel sheet, where you have various types of data categorized in rows and columns.
There are more differences.
- NumPy consist of the data type ndarray, which is create with fixed dimensions with only one element type.
- Pandas consist of Series and DataFrames, which are more dynamic after creation.
Performance comparison of NumPy and Pandas
If you should guess? Pandas? Of course not. NumPy is great magnitude faster than Pandas.
Let us first examine it.
import time import numpy as np import pandas as pd size = 100 iterations = 100000000//size a = np.arange(size) start = time.time() for _ in range(iterations): a2 = a * a end = time.time() print(end - start) n = pd.Series(a) start = time.time() for _ in range(iterations): n2 = n * n end = time.time() print(end - start)
Which results in the following comparison.
I find it very interesting that the speed is so slow for small instances of Pandas, comparing to NumPy, while later it seems to go to Pandas advantage, but eventually it still seems to be NumPy.
Well, the flexibility of Pandas has a cost, which is high for small instances when making arithmetic operations as we did in the above example.
Investigate further how NumPy and Pandas compare in performance for various functions.
Pandas and NumPy support a lot of functions in a vectorized way, which could be interesting to investigate. Do the restrictions of NumPy arrays give the underlying C/C++ code an advantage in performance?