Performance comparison of Numba vs Vectorization vs Lambda function with NumPy

What will we cover in this tutorial?

We will continue our investigation of Numba from this tutorial.

Numba is a just-in-time compiler for Python that works amazingly with NumPy. As we saw in the last tutorial, the built in vectorization can depending on the case and size of instance be faster than Numba.

Here we will explore that further as well to see how Numba compares with lambda functions. Lambda functions has the advantage, that they can be parsed as an argument down to a library that can optimize the performance and not depend on slow Python code.

Step 1: Example of Vectorization slower than Numba

In the previous tutorial we only investigated an example of vectorization, which was faster than Numba. Here we will see, that this is not always the case.

import numpy as np
from numba import jit
import time
size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000

@jit(nopython=True)
def add_numba(a, b):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + b[i, j]
    return c

def add_vectorized(a, b):
    return a + b

# We call the function once, to precompile the code
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
    z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
    z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Varying the size of the NumPy array, we can see the performance between the two in the graph below.

Where it is clear that the vectorized approach is slower.

Step 2: Try some more complex example comparing vectorized and Numba

A if-then-else can be expressed as vectorized using the Numpy where function.

import numpy as np
from numba import jit
import time

size = 1000
x = np.random.rand(size, size)
iterations = 1000

@jit(nopython=True)
def numba(a):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if a[i, j] < 0.5:
                c[i, j] = 1
    return c

def vectorized(a):
    return np.where(a < 0.5, 1, 0)

# We call the numba function to precompile it before we measure it
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

This results in the following comparison.

That is close, but the vectorized approach is a bit faster.

Step 3: Compare Numba with lambda functions

I am very curious about this. Lambda functions are controversial in Python, and many are not happy about them as they have a lot of syntax, which is not aligned with Python. On the other hand, lambda functions have the advantage that you can send them down in the library that can optimize over the for-loops.

import numpy as np
from numba import jit
import time
size = 1000
x = np.random.rand(size, size)
iterations = 1000

@jit(nopython=True)
def numba(a):
    c = np.zeros((size, size))
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + 1
    return c

def lambda_run(a):
    return a.apply(lambda x: x + 1)

# Call the numba function to precompile it before time measurement
z = numba(x)
start = time.time()
for _ in range(iterations):
    z = numba(x)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))
start = time.time()
for _ in range(iterations):
    z = vectorized(x)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

Resulting in the following performance comparison.

This is again tight, but the lambda approach is still a bit faster.

Remember, this is a simple lambda function and we cannot conclude that lambda function in general are faster than using Numba.

Conclusion

Learnings since the last tutorial is that we have found an example where simple vectorization is slower than Numba. This still leads to the conclusion that performance highly depends on the task. Further, the lambda function seems to give promising performance. Again, this should be compared to the slow approach of a Python for-loop without Numba just-in-time compiled machine code.

Sort a Python List with String of Integers or a Mixture

What will we cover in this tutorial?

  • How can you sort a list of strings containing integers by the integer value?
  • Or what if it contains both strings containing integers and integers?
  • Finally, also how if only a substring contains integers?

Why sort on a list of integers represented in strings fails

First of, we need to understand why it is not trivial to solve by just calling sort on the list.

Let’s just try with an example.

l = ['4', '8', '12', '23', '4']
l.sort()
print(l)

Which will result in the following list.

['12', '23', '4', '4', '8']

Where you see the list is sorted lexicographical order and not by the numeric value the strings represent.

How to solve this

Solving this is quite straight forward if you know your way around Python. You look in the documentation and see that it takes a key as argument. Okay, you are new to this, so what does it mean.

key specifies a function of one argument that is used to extract a comparison key from each list element

Python docs.

Still not comfortable about it. Let’s try to figure it out together. If you are new to Python, you might not know that you can send functions as arguments like any other value.

The key argument is a function that will be applied on every item in the list. The output of that function will be used to make a simple comparison and order it by that.

That is great news. Why?

I am glad you asked. If we just use the int() function as argument, it should cast the string to an integer and use that for comparison and our problem is solved.

Let’s try.

l = ['4', '8', '12', '23', '4']
l.sort(key=int)
print(l)

Resulting to the following list.

['4', '4', '8', '12', '23']

How simple is that?

What if my list is a mixture of integers and strings of integers?

What is your wild guess?

l = ['4', '8', 12, '23', 4]
l.sort(key=int)
print(l)

Notice that some integers are not strings any more. Let see the output.

['4', 4, '8', 12, '23']

It works. This is why we love Python!

But what if it is more complex?

A complex examples of sorting

Say we have a list of of strings like this one.

l = ['4 dollars', '8 dollars', '12 dollars', '23 dollars', '4 dollars']

The story is something like this. You ask a lot of providers how much it will cost to give a specific service. The answers are given in the list and you want to investigate them in order of lowest price.

We can just do the same, right?

l = ['4 dollars', '8 dollars', '12 dollars', '23 dollars', '4 dollars']
l.sort(key=int)
print(l)

Wrong!

Traceback (most recent call last):
  File "main.py", line 2, in <module>
    l.sort(key=int)
ValueError: invalid literal for int() with base 10: '4 dollars'

The string is not just an integer. It contains more information.

The good luck is that we can send any function. Let’s try to create one.

def comp(o):
  return int(o.split()[0])
l = ['4 dollars', '8 dollars', '12 dollars', '23 dollars', '4 dollars']
l.sort(key=comp)
print(l)

And the output is as desired.

['4 dollars', '4 dollars', '8 dollars', '12 dollars', '23 dollars']

Too fast? Let’s just analyse our function comp. It contains only one return statement. Try to read it from inside out.

o.split() splits the string up in a list of items contain word by word. Hence, the call of ‘4 dollars’.split() will result in [‘4’, ‘dollars’].

Then o.split()[0] will return the first item of that list, i.e. ‘4’.

Finally, we cast it to an integer by int(o.split()[0]).

Remember that the comparison is done by the output of the function, that is what the function returns, which in this case is the integer represented by the first item in the string.

What about lambda?

Lambda? Yes, lambda functions is also a hot subject.

A lambda function is just a smart way to write simple functions you send as arguments to other functions. Like in this case a sorting function.

Let’s try if we can do that.

l = ['4 dollars', '8 dollars', '12 dollars', '23 dollars', '4 dollars']
l.sort(key=lambda o: int(o.split()[0]))
print(l)

Resulting in the same output.

['4 dollars', '4 dollars', '8 dollars', '12 dollars', '23 dollars']

A bit magic with lambda functions? We advice you to read this tutorial on the subject.

Pandas: Data Preparation with Vectorized Strings vs Lambda Functions

What will we cover in this tutorial?

Understand the challenge

Most of the time when you read data into a pandas DataFrame it need to be prepared.

To be more concrete, let’s look at an example. Let’s consider we want to look at the List of largest companies by revenue on Wikipedia.

From wikipedia.org

You can read an inspect the data by the following code by using a DataFrame from the pandas library.

import pandas as pd
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
tables = pd.read_html(url)
table = tables[0]
print(table)
print(table.dtypes)

Notice that we use pd.set_option calls to get the full view. If you are new to read_html from the pandas library, we recommend you read this tutorial.

The top of the output will be as follows.

    Rank                        Name                     Industry Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail              $514,405               $6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas              $414,650               $5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas              $396,556              $23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas              $392,976               $2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity              $387,056               $8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas              $355,905             $110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas              $303,738               $9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas              $290,212              $20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive              $278,341              $14,332     664496                       Germany  [13]

And the last lines.

Rank                      int64
Name                     object
Industry                 object
Revenue(USD millions)    object
Profit(USD millions)     object
Employees                 int64
Country                  object
Ref                      object
dtype: object

Where we see interesting information about what data types each column has. Not surprisingly, the Revenue and Profit columns are of type object (which are strings in this case).

Hence, if we want to sum up values, we need to transform them to floats. This is a bit tricky, as the output shows above. An example is $6,670, where there are two issues to transform them to floats. First, there is a dollars ($) sign in the beginning. Second, there is comma (,) in the number, which a simple cast to float does not handle.

Now let’s deal with them in each their method.

Method 1: Using pandas DataFrame/Series vectorized string functions

Vectorization with pandas data structures is the process of executing operations on entire data structure. This is handy, as the alternative would be to make a loop-function.

Also, the pandas has many string functions available for vectorization as you can see in the documentation.

First of, we can access the string object by using the .str, then we can apply the string function. In our case, we will use the substring with square brackets to remove the dollar sign.

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:]

Which will give the following output.

    Rank                        Name                     Industry Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail               514,405               $6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas               414,650               $5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas               396,556              $23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas               392,976               $2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity               387,056               $8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas               355,905             $110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas               303,738               $9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas               290,212              $20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive               278,341              $14,332     664496                       Germany  [13]

Then we need to remove the comma (,). This can be done by using replace.

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '')

Which will result in the following output.

    Rank                        Name                     Industry Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail                514405               $6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas                414650               $5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas                396556              $23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas                392976               $2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity                387056               $8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas                355905             $110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas                303738               $9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas                290212              $20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive                278341              $14,332     664496                       Germany  [13]

Finally, we need to convert the string to a float.

index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '').astype(float)

Which does not change the printed output, but the type of the column.

Nice and easy, to prepare the data in one line. Notice, that you could chose to make it in multiple lines. It is a matter of taste.

Method 2: Using pandas DataFrame lambda function

Another way to prepare data is by using a lambda function. If you are new to lambda functions, we recommend you read this tutorial.

Here you can do it row by row and apply your defined lambda function.

The next column has the same challenge as the first one. So let’s apply it on that.

In this case, we cannot use the substring with square brackets like in the case above, as some figures are negative and contain that minus sign before the dollar sign. But using the replace call will do fine.

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: row[index_p].replace('$', ''), axis=1)

Which would result in the following output.

    Rank                        Name                     Industry  Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail               514405.0                6,670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas               414650.0                5,845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas               396556.0               23,352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas               392976.0                2,270    1382401                         China   [8]
4      5                  State Grid                  Electricity               387056.0                8,174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas               355905.0              110,974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas               303738.0                9,383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas               290212.0               20,840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive               278341.0               14,332     664496                       Germany  [13]

Then we do the same to remove the comma (,).

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: row[index_p].replace('$', '').replace(',', ''), axis=1)

Which result in the following output.

    Rank                        Name                     Industry  Revenue(USD millions) Profit(USD millions)  Employees                       Country   Ref
0      1                     Walmart                       Retail               514405.0                 6670    2200000                 United States   [5]
1      2               Sinopec Group                  Oil and gas               414650.0                 5845     619151                         China   [6]
2      3           Royal Dutch Shell                  Oil and gas               396556.0                23352      81000  Netherlands / United Kingdom   [7]
3      4    China National Petroleum                  Oil and gas               392976.0                 2270    1382401                         China   [8]
4      5                  State Grid                  Electricity               387056.0                 8174     917717                         China   [9]
5      6                Saudi Aramco                  Oil and gas               355905.0               110974      76418                  Saudi Arabia  [10]
6      7                          BP                  Oil and gas               303738.0                 9383      73000                United Kingdom  [11]
7      8                  ExxonMobil                  Oil and gas               290212.0                20840      71000                 United States  [12]
8      9                  Volkswagen                   Automotive               278341.0                14332     664496                       Germany  [13]

Finally, we will do the same for casting it to a float.

index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: float(row[index_p].replace('$', '').replace(',', '')), axis=1)

Which will produce the same output.

Comparing the two methods

To be honest, it is a matter of taste in this case. When things can be achieved by simple string manipulation calls that are available through the vectorized calls, there is nothing to gain by lambda functions.

The strength of lambda functions is the flexibility. You can actually do anything function in there, which is a big strength. The vectorized functions are limited to simple operations, which covers a lot of use cases.

Putting it all together

Well, now we came so far, let’s put it all together and get some nice data. Sum it up and print it sorted out and make a horizontal bar plot.

import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
tables = pd.read_html(url)
table = tables[0]
index_r = 'Revenue(USD millions)'
table[index_r] = table[index_r].str[1:] .str.replace(',', '').astype(float)
index_p = 'Profit(USD millions)'
table[index_p] = table.apply(lambda row: float(row[index_p].replace('$', '').replace(',', '')), axis=1)
table = table.drop(['Rank'], axis=1)

print(table.groupby('Country').sum().sort_values([index_r, index_p], ascending=False))
table.groupby('Industry').sum().sort_values([index_r, index_p], ascending=False).plot.barh()
plt.show()

The output graph.

Resulting output graph.

And the output from the program.

                              Revenue(USD millions)  Profit(USD millions)  Employees
Country                                                                             
United States                             4169049.0              243970.0    6585076
China                                     2263521.0              182539.0    5316321
Germany                                    602635.0               31693.0    1105639
Japan                                      561157.0               27814.0     670636
Netherlands / United Kingdom               396556.0               23352.0      81000
Saudi Arabia                               355905.0              110974.0      76418
France                                     309684.0               13971.0     208525
United Kingdom                             303738.0                9383.0      73000
Russia                                     250447.0               33062.0     568600
South Korea                                221579.0               39895.0     221579
Switzerland                                219754.0                3408.0      85504
Singapore                                  180744.0                 849.0       4316
Taiwan                                     175617.0                4281.0     667680
Netherlands                                175009.0                1589.0     314790