Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.
Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.
Step 1: Example of Vectorization slower than Numba
In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.
import numpy as np
from numba import jit
import time
size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000
@jit(nopython=True)
def add_numba(a, b):
c = np.zeros(a.shape)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i, j] = a[i, j] + b[i, j]
return c
def add_vectorized(a, b):
return a + b
# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))
Varying the size of the NumPy array, we can see the performance between the two in the graph below.
Where it is clear that the vectorized approach is slower.
Step 2: Try some more complex example comparing vectorized and Numba
A if-then-else can be expressed as vectorized using the Numpywhere function.
import numpy as np
from numba import jit
import time
size = 1000
x = np.random.rand(size, size)
iterations = 1000
@jit(nopython=True)
def numba(a):
c = np.zeros(a.shape)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
if a[i, j] < 0.5:
c[i, j] = 1
return c
def vectorized(a):
return np.where(a < 0.5, 1, 0)
# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))
This results in the following comparison.
That is close, but the vectorized approach is a bit faster.
Step 3: Compare Numba with lambda functions
I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.
import numpy as np
from numba import jit
import time
size = 1000
x = np.random.rand(size, size)
iterations = 1000
@jit(nopython=True)
def numba(a):
c = np.zeros((size, size))
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i, j] = a[i, j] + 1
return c
def lambda_run(a):
return a.apply(lambda x: x + 1)
# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))
Resulting in the following performance comparison.
This is again tight, but the lambda approach is still a bit faster.
Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.
Conclusion
Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.
Notice that we use pd.set_option calls to get the full view. If you are new to read_html from the pandas library, we recommend you read this tutorial.
The top of the output will be as follows.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail $514,405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas $414,650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas $396,556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas $392,976 $2,270 1382401 China [8]
4 5 State Grid Electricity $387,056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas $355,905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas $303,738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas $290,212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive $278,341 $14,332 664496 Germany [13]
And the last lines.
Rank int64
Name object
Industry object
Revenue(USD millions) object
Profit(USD millions) object
Employees int64
Country object
Ref object
dtype: object
Where we see interesting information about what data types each column has. Not surprisingly, the Revenue and Profit columns are of type object (which are strings in this case).
Hence, if we want to sum up values, we need to transform them to floats. This is a bit tricky, as the output shows above. An example is $6,670, where there are two issues to transform them to floats. First, there is a dollars ($) sign in the beginning. Second, there is comma (,) in the number, which a simple cast to float does not handle.
Now let’s deal with them in each their method.
Method 1: Using pandas DataFrame/Series vectorized string functions
Vectorization with pandas data structures is the process of executing operations on entire data structure. This is handy, as the alternative would be to make a loop-function.
Also, the pandas has many string functions available for vectorization as you can see in the documentation.
First of, we can access the string object by using the .str, then we can apply the string function. In our case, we will use the substring with square brackets to remove the dollar sign.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514,405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414,650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396,556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392,976 $2,270 1382401 China [8]
4 5 State Grid Electricity 387,056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355,905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303,738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290,212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive 278,341 $14,332 664496 Germany [13]
Then we need to remove the comma (,). This can be done by using replace.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405 $6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650 $5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556 $23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976 $2,270 1382401 China [8]
4 5 State Grid Electricity 387056 $8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905 $110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738 $9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212 $20,840 71000 United States [12]
8 9 Volkswagen Automotive 278341 $14,332 664496 Germany [13]
Finally, we need to convert the string to a float.
Which does not change the printed output, but the type of the column.
Nice and easy, to prepare the data in one line. Notice, that you could chose to make it in multiple lines. It is a matter of taste.
Method 2: Using pandas DataFrame lambda function
Another way to prepare data is by using a lambda function. If you are new to lambda functions, we recommend you read this tutorial.
Here you can do it row by row and apply your defined lambda function.
The next column has the same challenge as the first one. So let’s apply it on that.
In this case, we cannot use the substring with square brackets like in the case above, as some figures are negative and contain that minus sign before the dollar sign. But using the replace call will do fine.
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405.0 6,670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650.0 5,845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556.0 23,352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976.0 2,270 1382401 China [8]
4 5 State Grid Electricity 387056.0 8,174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905.0 110,974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738.0 9,383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212.0 20,840 71000 United States [12]
8 9 Volkswagen Automotive 278341.0 14,332 664496 Germany [13]
Rank Name Industry Revenue(USD millions) Profit(USD millions) Employees Country Ref
0 1 Walmart Retail 514405.0 6670 2200000 United States [5]
1 2 Sinopec Group Oil and gas 414650.0 5845 619151 China [6]
2 3 Royal Dutch Shell Oil and gas 396556.0 23352 81000 Netherlands / United Kingdom [7]
3 4 China National Petroleum Oil and gas 392976.0 2270 1382401 China [8]
4 5 State Grid Electricity 387056.0 8174 917717 China [9]
5 6 Saudi Aramco Oil and gas 355905.0 110974 76418 Saudi Arabia [10]
6 7 BP Oil and gas 303738.0 9383 73000 United Kingdom [11]
7 8 ExxonMobil Oil and gas 290212.0 20840 71000 United States [12]
8 9 Volkswagen Automotive 278341.0 14332 664496 Germany [13]
Finally, we will do the same for casting it to a float.
To be honest, it is a matter of taste in this case. When things can be achieved by simple string manipulation calls that are available through the vectorized calls, there is nothing to gain by lambda functions.
The strength of lambda functions is the flexibility. You can actually do anything function in there, which is a big strength. The vectorized functions are limited to simple operations, which covers a lot of use cases.
Putting it all together
Well, now we came so far, let’s put it all together and get some nice data. Sum it up and print it sorted out and make a horizontal bar plot.