Data Science # 15 Most Useful pandas Shortcut Methods

Everybody likes the pandas data structure DataFrame, but most miss out on what powerful methods it provides.

pandas is a huge module, which makes it difficult to master. Most just use the data structure (DataFrame), without utilizing the power of the methods. In this tutorial you will learn the 15 most useful shortcut that will help you when working with data in pandas data structures.

The **groupby** method involves some combination of splitting the object, applying a function, and combining the result.

Wow. That sounds complex. But it is not. It can be used to group large amounts of data and compute operations on these groups.

Best way to learn is to see some example.

```
import pandas as pd
data = {'Items': ['Apple','Orange', 'Pear', 'Orange', 'Apple'],
'Price': [12, 5, 3, 7, 24]}
df = pd.DataFrame(data)
```

This results in this DataFrame.

The **groupby** method can group the items together, and apply a function. Let’s try it here.

```
df.groupby(['Items']).mean()
```

This will result in this output.

As you see, it has grouped the **Apples**, **Oranges**, and the **Pears** together and for the price column, it has applied the** mean()** function on the values.

Hence, the **Apple** has value 18, as it is the mean of 12 and 24 ((12 + 24)/2). Similar, for **Orange** and **Pear**.

We get more and more data and our project get bigger and bigger. At one point you will need to analyze how much memory your data is using.

What **memory_usage** does, is, it returns the memory usage of each column in the DataFrame. Sometimes, the data type of a column is object, what that means is, that it is pointing to another object. To get the data usage of these objects included, you need to use the **deep=True** argument.

Let’s try both, to see the difference.

```
import pandas as pd
dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
data = dict([(t, np.ones(shape=1000, dtype=int).astype(t)) for t in
dtypes])
df = pd.DataFrame(data)
print(df.head())
```

Then we can get the memory usage as follows.

```
print(df.memory_usage())
```

Giving the following.

```
Index 128
int64 8000
float64 8000
complex128 16000
object 8000
bool 1000
dtype: int64
```

Also, with deep=True.

```
df.memory_usage(deep=True)
```

Giving the following where you see the object column is uses more space.

```
Index 128
int64 8000
float64 8000
complex128 16000
object 36000
bool 1000
dtype: int64
```

**clip()** can trim values at the input threshold.

I find this is easiest to understand by inspecting an example.

```
import pandas as pd
data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)
print(df)
```

Then we apply the clip, which will ensure the values below -2 are replaced with -2, and values above 5, are replaced with 5. It clips the values.

```
print(df.clip(-2, 5))
```

The correlation between the values in a column can be calculate with **corr()**. There are different methods to use: **Pearson**, **Kendall**, and **Spearman**. By default it uses the **Pearson** method, which will do fine giving you an idea if columns are correlated.

Let’s try an example.

```
import pandas as pd
df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
columns=['dogs', 'cats'])
```

The correlation is given by.

```
print(df.corr())
```

The value 1.0 is saying it is perfect correlation, which are shown in the diagonal. This makes sense, as the diagonal is the column with itself.

To learn more about correlation and statistics, be sure to check this tutorial out, which also explains the correlation value and how to interpret it.

The name **argmin** is a bit strange. What it does, it returns the position (the index) of the smallest value in a Series (column of a DataFrame).

```
import pandas as pd
s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})
print(s)
```

Gives.

```
Corn Flakes 100.0
Almond Delight 110.0
Cinnamon Toast Crunch 120.0
Cocoa Puff 110.0
dtype: float64
```

And to get the position of the smallest value, just apply the method.

```
print(s.argmin())
```

Which will give **0**. Remember that it is zero-index, meaning that the first element has index 0.

Just like **argmin**, then **argmax() **returns the largest element in a Series.

Continue with the example from above.

```
print(s.argmax())
```

This will give** 2**, as it is the largest element in the series.

Want to know the differences between DataFrames? Then **compare** does a great job at that.

```
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col1": [1.0, 2.0, 3.0, np.nan, 5.0],
"col2": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2"],
)
```

We can compare the columns here.

```
df['col1'].compare(df['col2'])
```

As you see, the only row that differ is the above.

Did you ever need to replace a value in a DataFrame? Well, it also has a method for that and it is called **replace()**.

```
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
```

Let’s try to replace 5 with -10 and see what happens.

```
print(df.replace(5, -10))
```

Wanted to find missing values? Then **isna** can do that for you.

Let’s try it.

```
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(age=[5, 6, np.NaN],
born=[pd.NaT, pd.Timestamp('1939-05-27'),
pd.Timestamp('1940-04-25')],
name=['Alfred', 'Batman', ''],
toy=[None, 'Batmobile', 'Joker']))
```

Then you get the values as follows.

```
print(df.isna())
```

I often use it also in a combination with **sum(),** which will then tell how many rows in each column are missing. This is interesting to get an idea about the quality of the dataset.

```
print(df.isna().sum())
```

```
age 1
born 1
name 0
toy 1
dtype: int64
```

On the subject of missing values, what to do? Well, there are many options, but one simple can be to **interpolate** the values.

```
import pandas as pd
import numpy as np
s = pd.Series([0, 1, np.nan, 3])
```

This gives the following series.

```
0 0.0
1 1.0
2 NaN
3 3.0
dtype: float64
```

Then you can interpolate and get the value between them.

```
print(s.interpolate())
```

```
0 0.0
1 1.0
2 2.0
3 3.0
dtype: float64
```

This is just one way to deal with it. Dealing with missing values is a big subject. To learn more read this tutorial on the subject.

Ever needed to remove a column in a DataFrame? Well, again they made a method for that.

Let’s try the **drop() **method to remove a column.

```
import pandas as pd
data = {'Age': [-44,0,5, 15, 10, -3],
'Salary': [0,5,-2, -14, 19, 24]}
df = pd.DataFrame(data)
```

Then let’s remove the **Age** column.

```
df2 = df.drop('Age', axis='columns')
print(df2)
```

Notice, that it returns a new DataFrame.

Dealing with data that has duplicate rows? Well, it is a common problem and pandas made a method to easily remove them from your DataFrame.

It is called **drop_duplicates** and does what it says.

Let’s try it.

```
import pandas as pd
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
```

This DataFrame as duplicate rows. Let’s see how they can be removed.

```
df2 = df.drop_duplicates()
print(df2)
```

Ever needed to sum a column? Even with multi index?

Let’s try.

```
import pandas as pd
idx = pd.MultiIndex.from_arrays([
['warm', 'warm', 'cold', 'cold'],
['dog', 'falcon', 'fish', 'spider']],
names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
print(s)
```

This will output.

```
blooded animal
warm dog 4
falcon 2
cold fish 0
spider 8
Name: legs, dtype: int64
In [29]:
```

Then this will sum the column.

```
print(s.sum())
```

And it will output **14**, as expected.

Wanted to make a cumulative sum? Then **cumsum**() does the job for you, even with missing numbers.

```
import pandas as pd
s = pd.Series([2, np.nan, 5, -1, 0])
print(s)
```

This will give.

```
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64
```

And then.

```
print(s.cumsum())
```

Gives.

```
0 2.0
1 NaN
2 7.0
3 6.0
4 6.0
dtype: float64
```

Where it makes a cumulative sum down the column.

The **value_counts()** method returns the number of unique rows in a DataFrame.

This requires an example to really understand.

```
df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
'num_wings': [2, 0, 0, 0]},
index=['falcon', 'dog', 'cat', 'ant'])
```

Here we see we have two rows with 4 and 0 (in that order), while the other rows have unique values.

```
print(df.value_counts())
```

```
num_legs num_wings
4 0 2
2 2 1
6 0 1
dtype: int64
```

We see there are two rows with 4 and 0, and one of the other rows.

Wanted the unique elements in your Series?

Here you go.

```
import pandas as pd
s = pd.Series([2, 1, 3, 3], name='A')
print(s.unique())
```

This will give the unique elements.

```
array([2, 1, 3])
```

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

**15 video lessons**– covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (**YouTube video**).**30 JuPyter Notebooks**– with the full code and explanation from the lectures and projects (GitHub).**15 projects**– structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).

Why learn Python? There are many reasons to learn Python, and that is the power…

3 days ago

What will you learn? How to use the modulo operator to check if a number…

1 week ago

There are a lot of Myths out there There are lot of Myths about being…

2 months ago

To be honest, I am not really a great programmer - that is not what…

2 months ago

What does it take to become a Data Scientist? Data Science is in a cross…

2 months ago

What will you learn? Need to setup a SQL server? You don’t need to install…

4 months ago

## View Comments

The way of your explaining is good.

Thank you Parker.