15 Most Useful pandas Shortcut Methods

What will you learn?

Everybody likes the pandas data structure DataFrame, but most miss out on what powerful methods it provides.

pandas is a huge module, which makes it difficult to master. Most just use the data structure (DataFrame), without utilizing the power of the methods. In this tutorial you will learn the 15 most useful shortcut that will help you when working with data in pandas data structures.

#1 groupby

The groupby method involves some combination of splitting the object, applying a function, and combining the result.

Wow. That sounds complex. But it is not. It can be used to group large amounts of data and compute operations on these groups.

Best way to learn is to see some example.

import pandas as pd
data = {'Items': ['Apple','Orange', 'Pear', 'Orange', 'Apple'], 
        'Price': [12, 5, 3, 7, 24]}
df = pd.DataFrame(data)

This results in this DataFrame.

The groupby method can group the items together, and apply a function. Let’s try it here.

df.groupby(['Items']).mean()

This will result in this output.

As you see, it has grouped the Apples, Oranges, and the Pears together and for the price column, it has applied the mean() function on the values.

Hence, the Apple has value 18, as it is the mean of 12 and 24 ((12 + 24)/2). Similar, for Orange and Pear.

#2 memory_usage()

We get more and more data and our project get bigger and bigger. At one point you will need to analyze how much memory your data is using.

What memory_usage does, is, it returns the memory usage of each column in the DataFrame. Sometimes, the data type of a column is object, what that means is, that it is pointing to another object. To get the data usage of these objects included, you need to use the deep=True argument.

Let’s try both, to see the difference.

import pandas as pd
dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
data = dict([(t, np.ones(shape=1000, dtype=int).astype(t)) for t in
                                                          dtypes])
df = pd.DataFrame(data)
print(df.head())

Then we can get the memory usage as follows.

print(df.memory_usage())

Giving the following.

Index           128
int64          8000
float64        8000
complex128    16000
object         8000
bool           1000
dtype: int64

Also, with deep=True.

df.memory_usage(deep=True)

Giving the following where you see the object column is uses more space.

Index           128
int64          8000
float64        8000
complex128    16000
object        36000
bool           1000
dtype: int64

#3 clip()

clip() can trim values at the input threshold.

I find this is easiest to understand by inspecting an example.

import pandas as pd
data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
print(df)

Then we apply the clip, which will ensure the values below -2 are replaced with -2, and values above 5, are replaced with 5. It clips the values.

print(df.clip(-2, 5))

#4 corr()

The correlation between the values in a column can be calculate with corr(). There are different methods to use: Pearson, Kendall, and Spearman. By default it uses the Pearson method, which will do fine giving you an idea if columns are correlated.

Let’s try an example.

import pandas as pd
df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
                  columns=['dogs', 'cats'])

The correlation is given by.

print(df.corr())

The value 1.0 is saying it is perfect correlation, which are shown in the diagonal. This makes sense, as the diagonal is the column with itself.

To learn more about correlation and statistics, be sure to check this tutorial out, which also explains the correlation value and how to interpret it.

#5 argmin()

The name argmin is a bit strange. What it does, it returns the position (the index) of the smallest value in a Series (column of a DataFrame).

import pandas as pd
s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
               'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})
print(s)

Gives.

Corn Flakes              100.0
Almond Delight           110.0
Cinnamon Toast Crunch    120.0
Cocoa Puff               110.0
dtype: float64

And to get the position of the smallest value, just apply the method.

print(s.argmin())

Which will give 0. Remember that it is zero-index, meaning that the first element has index 0.

#6 argmax()

Just like argmin, then argmax() returns the largest element in a Series.

Continue with the example from above.

print(s.argmax())

This will give 2, as it is the largest element in the series.

#7 compare()

Want to know the differences between DataFrames? Then compare does a great job at that.

import pandas as pd
import numpy as np
df = pd.DataFrame(
     {
         "col1": [1.0, 2.0, 3.0, np.nan, 5.0],
         "col2": [1.0, 2.0, 3.0, 4.0, 5.0]
     },
     columns=["col1", "col2"],
)

We can compare the columns here.

df['col1'].compare(df['col2'])

As you see, the only row that differ is the above.

#8 replace()

Did you ever need to replace a value in a DataFrame? Well, it also has a method for that and it is called replace().

df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})

Let’s try to replace 5 with -10 and see what happens.

print(df.replace(5, -10))

#9 isna()

Wanted to find missing values? Then isna can do that for you.

Let’s try it.

import pandas as pd
import numpy as np
df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                  born=[pd.NaT, pd.Timestamp('1939-05-27'),
                        pd.Timestamp('1940-04-25')],
                  name=['Alfred', 'Batman', ''],
                  toy=[None, 'Batmobile', 'Joker']))

Then you get the values as follows.

print(df.isna())

I often use it also in a combination with sum(), which will then tell how many rows in each column are missing. This is interesting to get an idea about the quality of the dataset.

print(df.isna().sum())
age     1
born    1
name    0
toy     1
dtype: int64

#10 interpolation()

On the subject of missing values, what to do? Well, there are many options, but one simple can be to interpolate the values.

import pandas as pd
import numpy as np
s = pd.Series([0, 1, np.nan, 3])

This gives the following series.

0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64

Then you can interpolate and get the value between them.

print(s.interpolate())
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

This is just one way to deal with it. Dealing with missing values is a big subject. To learn more read this tutorial on the subject.

#11 drop()

Ever needed to remove a column in a DataFrame? Well, again they made a method for that.

Let’s try the drop() method to remove a column.

import pandas as pd
data = {'Age': [-44,0,5, 15, 10, -3], 
        'Salary': [0,5,-2, -14, 19, 24]}
df = pd.DataFrame(data)

Then let’s remove the Age column.

df2 = df.drop('Age', axis='columns')
print(df2)

Notice, that it returns a new DataFrame.

#12 drop_duplicates()

Dealing with data that has duplicate rows? Well, it is a common problem and pandas made a method to easily remove them from your DataFrame.

It is called drop_duplicates and does what it says.

Let’s try it.

import pandas as pd
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

This DataFrame as duplicate rows. Let’s see how they can be removed.

df2 = df.drop_duplicates()
print(df2)

#13 sum()

Ever needed to sum a column? Even with multi index?

Let’s try.

import pandas as pd
idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
print(s)

This will output.

blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
In [29]:

Then this will sum the column.

print(s.sum())

And it will output 14, as expected.

#14 cumsum()

Wanted to make a cumulative sum? Then cumsum() does the job for you, even with missing numbers.

import pandas as pd
s = pd.Series([2, np.nan, 5, -1, 0])
print(s)

This will give.

0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

And then.

print(s.cumsum())

Gives.

0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

Where it makes a cumulative sum down the column.

#15 value_counst()

The value_counts() method returns the number of unique rows in a DataFrame.

This requires an example to really understand.

df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                   'num_wings': [2, 0, 0, 0]},
                  index=['falcon', 'dog', 'cat', 'ant'])

Here we see we have two rows with 4 and 0 (in that order), while the other rows have unique values.

print(df.value_counts())
num_legs  num_wings
4         0            2
2         2            1
6         0            1
dtype: int64

We see there are two rows with 4 and 0, and one of the other rows.

Bonus: unique()

Wanted the unique elements in your Series?

Here you go.

import pandas as pd
s = pd.Series([2, 1, 3, 3], name='A')
print(s.unique())

This will give the unique elements.

array([2, 1, 3])

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

2 Replies to “15 Most Useful pandas Shortcut Methods”

Leave a Reply

%d bloggers like this: