15 Most Useful pandas Shortcut Methods

What will you learn?

Everybody likes the pandas data structure DataFrame, but most miss out on what powerful methods it provides.

pandas is a huge module, which makes it difficult to master. Most just use the data structure (DataFrame), without utilizing the power of the methods. In this tutorial you will learn the 15 most useful shortcut that will help you when working with data in pandas data structures.

#1 groupby

The groupby method involves some combination of splitting the object, applying a function, and combining the result.

Wow. That sounds complex. But it is not. It can be used to group large amounts of data and compute operations on these groups.

Best way to learn is to see some example.

import pandas as pd

data = {'Items': ['Apple','Orange', 'Pear', 'Orange', 'Apple'], 
        'Price': [12, 5, 3, 7, 24]}
df = pd.DataFrame(data)

This results in this DataFrame.

The groupby method can group the items together, and apply a function. Let’s try it here.

df.groupby(['Items']).mean()

This will result in this output.

As you see, it has grouped the Apples, Oranges, and the Pears together and for the price column, it has applied the mean() function on the values.

Hence, the Apple has value 18, as it is the mean of 12 and 24 ((12 + 24)/2). Similar, for Orange and Pear.

#2 memory_usage()

We get more and more data and our project get bigger and bigger. At one point you will need to analyze how much memory your data is using.

What memory_usage does, is, it returns the memory usage of each column in the DataFrame. Sometimes, the data type of a column is object, what that means is, that it is pointing to another object. To get the data usage of these objects included, you need to use the deep=True argument.

Let’s try both, to see the difference.

import pandas as pd

dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
data = dict([(t, np.ones(shape=1000, dtype=int).astype(t)) for t in
                                                          dtypes])
df = pd.DataFrame(data)

print(df.head())

Then we can get the memory usage as follows.

print(df.memory_usage())

Giving the following.

Index           128
int64          8000
float64        8000
complex128    16000
object         8000
bool           1000
dtype: int64

Also, with deep=True.

df.memory_usage(deep=True)

Giving the following where you see the object column is uses more space.

Index           128
int64          8000
float64        8000
complex128    16000
object        36000
bool           1000
dtype: int64

#3 clip()

clip() can trim values at the input threshold.

I find this is easiest to understand by inspecting an example.

import pandas as pd

data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
df = pd.DataFrame(data)

print(df)

Then we apply the clip, which will ensure the values below -2 are replaced with -2, and values above 5, are replaced with 5. It clips the values.

print(df.clip(-2, 5))

#4 corr()

The correlation between the values in a column can be calculate with corr(). There are different methods to use: Pearson, Kendall, and Spearman. By default it uses the Pearson method, which will do fine giving you an idea if columns are correlated.

Let’s try an example.

import pandas as pd

df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
                  columns=['dogs', 'cats'])

The correlation is given by.

print(df.corr())

The value 1.0 is saying it is perfect correlation, which are shown in the diagonal. This makes sense, as the diagonal is the column with itself.

To learn more about correlation and statistics, be sure to check this tutorial out, which also explains the correlation value and how to interpret it.

#5 argmin()

The name argmin is a bit strange. What it does, it returns the position (the index) of the smallest value in a Series (column of a DataFrame).

import pandas as pd

s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
               'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})

print(s)

Gives.

Corn Flakes              100.0
Almond Delight           110.0
Cinnamon Toast Crunch    120.0
Cocoa Puff               110.0
dtype: float64

And to get the position of the smallest value, just apply the method.

print(s.argmin())

Which will give 0. Remember that it is zero-index, meaning that the first element has index 0.

#6 argmax()

Just like argmin, then argmax() returns the largest element in a Series.

Continue with the example from above.

print(s.argmax())

This will give 2, as it is the largest element in the series.

#7 compare()

Want to know the differences between DataFrames? Then compare does a great job at that.

import pandas as pd
import numpy as np

df = pd.DataFrame(
     {
         "col1": [1.0, 2.0, 3.0, np.nan, 5.0],
         "col2": [1.0, 2.0, 3.0, 4.0, 5.0]
     },
     columns=["col1", "col2"],
)

We can compare the columns here.

df['col1'].compare(df['col2'])

As you see, the only row that differ is the above.

#8 replace()

Did you ever need to replace a value in a DataFrame? Well, it also has a method for that and it is called replace().

df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})

Let’s try to replace 5 with -10 and see what happens.

print(df.replace(5, -10))

#9 isna()

Wanted to find missing values? Then isna can do that for you.

Let’s try it.

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                  born=[pd.NaT, pd.Timestamp('1939-05-27'),
                        pd.Timestamp('1940-04-25')],
                  name=['Alfred', 'Batman', ''],
                  toy=[None, 'Batmobile', 'Joker']))

Then you get the values as follows.

print(df.isna())

I often use it also in a combination with sum(), which will then tell how many rows in each column are missing. This is interesting to get an idea about the quality of the dataset.

print(df.isna().sum())
age     1
born    1
name    0
toy     1
dtype: int64

#10 interpolation()

On the subject of missing values, what to do? Well, there are many options, but one simple can be to interpolate the values.

import pandas as pd
import numpy as np

s = pd.Series([0, 1, np.nan, 3])

This gives the following series.

0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64

Then you can interpolate and get the value between them.

print(s.interpolate())
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

This is just one way to deal with it. Dealing with missing values is a big subject. To learn more read this tutorial on the subject.

#11 drop()

Ever needed to remove a column in a DataFrame? Well, again they made a method for that.

Let’s try the drop() method to remove a column.

import pandas as pd

data = {'Age': [-44,0,5, 15, 10, -3], 
        'Salary': [0,5,-2, -14, 19, 24]}
df = pd.DataFrame(data)

Then let’s remove the Age column.

df2 = df.drop('Age', axis='columns')
print(df2)

Notice, that it returns a new DataFrame.

#12 drop_duplicates()

Dealing with data that has duplicate rows? Well, it is a common problem and pandas made a method to easily remove them from your DataFrame.

It is called drop_duplicates and does what it says.

Let’s try it.

import pandas as pd

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})

This DataFrame as duplicate rows. Let’s see how they can be removed.

df2 = df.drop_duplicates()
print(df2)

#13 sum()

Ever needed to sum a column? Even with multi index?

Let’s try.

import pandas as pd

idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
print(s)

This will output.

blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
In [29]:

Then this will sum the column.

print(s.sum())

And it will output 14, as expected.

#14 cumsum()

Wanted to make a cumulative sum? Then cumsum() does the job for you, even with missing numbers.

import pandas as pd

s = pd.Series([2, np.nan, 5, -1, 0])
print(s)

This will give.

0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

And then.

print(s.cumsum())

Gives.

0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

Where it makes a cumulative sum down the column.

#15 value_counst()

The value_counts() method returns the number of unique rows in a DataFrame.

This requires an example to really understand.

df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                   'num_wings': [2, 0, 0, 0]},
                  index=['falcon', 'dog', 'cat', 'ant'])

Here we see we have two rows with 4 and 0 (in that order), while the other rows have unique values.

print(df.value_counts())
num_legs  num_wings
4         0            2
2         2            1
6         0            1
dtype: int64

We see there are two rows with 4 and 0, and one of the other rows.

Bonus: unique()

Wanted the unique elements in your Series?

Here you go.

import pandas as pd

s = pd.Series([2, 1, 3, 3], name='A')
print(s.unique())

This will give the unique elements.

array([2, 1, 3])

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

Then check my free Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science

11 Great List Methods That Most Don’t Know

What will you learn?

After this list you will know the most useful methods and techniques with Python lists. It is funny, that many of these are unknown to many that work with Python. Be sure not to be one of them as they can save you a lot of time.

#1 in

This one is probably my favorite. To check if an element is in the list. This is easy with Python and just why we can’t stop loving it.

this_list = ['foo', 'bar', 'foobar', 'barfoo']

if 'foo' in this_list:
    print('yes')

if 'foofoo' in this_list:
    print('no')

This will only print yes as ‘foofoo‘ is not an element in the list this_list.

On the other hand ‘foo‘ is in the list, hence it prints yes.

#2 sorted()

This is actually a built-in function and it returns a new sorted list from the items in an iterable.

An iterable is a list or anything you can iterate over in Python.

Let’s try it.

this_list = ['foo', 'bar', 'foobar', 'barfoo']

print(sorted(this_list))
print(sorted(this_list, reverse=True))

This will output.

['bar', 'barfoo', 'foo', 'foobar']
['foobar', 'foo', 'barfoo', 'bar']

Hence, it can also sort in reverse order.

It is important to understand that it returns a new list with the items from the input argument.

Want to learn more Built-in functions?

#3 sort()

This is kind of funny. And it might look like the same. But there is a big difference.

The method sort() will sort the list in-place and not return a new list.

Let’s try it.

this_list = ['foo', 'bar', 'foobar', 'barfoo']

this_list.sort()
print(this_list)

This will output.

['bar', 'barfoo', 'foo', 'foobar']

What was the difference? Well, sorted() returns a new list, and sort() sorts the list and changes the order of the element of the original list.

Why does that matter?

There can be many reasons – but one major one is if the list is huge and you need to save space and time – then in-place sorting is both more memory and speed efficient.

#4 for

Again, a favorite of mine. Iterating over a list in Python is such a pleasure. Why make complex syntax for something you need all the time?

Well, in Python it is easy to do.

this_list = ['bar', 'barfoo', 'foo', 'foobar']

for item in this_list:
    print(item)

This will output.

bar
barfoo
foo
foobar

#5 append()

Adding an element to the end of a list is something you need all the time. Again, Python does it in a simple manner.

this_list = ['bar', 'barfoo', 'foo', 'foobar']

this_list.append('foobarfoo')

print(this_list)

This will output.

['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo']

And see the added (appended) element at the end of the list.

#6 concatenate lists

This one is great. Again Python goes the extra mile to make things a simple as possible.

If you need to concatenate two lists together. How is that done?

Well, with the addition sign. See this.

list_a = ['foo', 'bar']
list_b = ['foobar', 'barfoo']

list_c = list_a + list_b

print(list_c)

This will output.

['foo', 'bar', 'foobar', 'barfoo']

#7 index()

Sometimes you need the index of an element in a list.

Well, here you go.

this_list = ['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo']

print(this_list.index('foo'))

This will output 2, as ‘foo‘ has the first occurrence in the list at index 2 (remember, that Python lists are zero indexed – meaning that the first element is 0, second element is 1, etc.).

#8 copy()

Sometimes you need a new copy of a list. Let’s try and understand what that means.

this_list = ['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo']

list_copy = this_list.copy()

list_copy.append('me element')

print(list_copy)
print(this_list)

This will output.

['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo', 'me element']
['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo']

As you see, the append only modifies the list it is appended to. This might not surprise you.

But let’s try something different.

a = this_list

a.append('element')

print(a)
print(this_list)

This will output.

['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo', 'element']
['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo', 'element']

Oh no – the element got appended to both lists – or what?

No, actually it is the same list, it is just two variables, which point to the same list.

Now copy starts to make sense.

#9 remove()

Now what if you need to remove an element form this list?

Let’s try how easy it could be.

this_list = ['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo', 'element']

this_list.remove('element')

print(this_list)

This will output.

['bar', 'barfoo', 'foo', 'foobar', 'foobarfoo']

Oh, it was that simple.

#10 pop()

Actually a Python list can be used as a Stack out of the box.

Append (as we saw above) pushes an element to the back of the list – or you could say, the top of the stack.

And then, what does pop do?

this_list = ['barfoo', 'foo', 'foobar', 'foobarfoo']

element = this_list.pop()

print(element)
print(this_list)

This will output.

foobarfoo
['barfoo', 'foo', 'foobar']

As you see, this is just as a stack.

You might complain, it does not have the same performance (big O) as a real stack. Well, because the list is implemented in such an awesome way, it has asymptotic the the same big-O complexity. What does that mean? That it is just as good as a “real” stack.

#11 del

This is actually one I don’t use that much – but it is handy if you like this notation.

Let’s just try it.

this_list = ['barfoo', 'foo', 'foobar']

del this_list[1]

print(this_list)

This will delete element at position 1 and give this output.

['barfoo', 'foobar']

Wow. That was great.

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.’

9 Python Mistakes only Beginners Make

What will we cover?

To get better at Python programming it is good to explore what common mistakes beginners do. This will help you avoid the same pitfalls and understand why you shouldn’t do them.

#1 Use general imports

A common mistake beginners do are importing with the wildcard (*).

from module import *

First of all, this is bad practice and can make your program slow, as you import modules you don’t need.

Also, this can import shadows variables names (the same name from different sources) and make it difficult to know what variable you are referring to.

Instead you should import specific object or whole modules you need.

You should do the following if the module name is short.

import module

This will give you access to all the sub-modules with syntax like: module.submodule.

If the module name is long, then you should do as follows.

import longmodulename as lmn

This gives access to the module with lmn.

If you only need one (or a few, but not too many) element from a module then the following.

from module import xx

The above 3 ways to do it, makes the code easy to read and understand.

#2 try/except Without Specific Exception

It is common for beginners to have a broad except Exception, as it is easy.

Example could be like this.

try:
    # do stuff
except:
    print('Some stuff went wrong!')

First of all, it does not follow the PEP8 standard. You should always try to follow PEP8, as it makes your code easier to read, maintain, and understand for others and yourself later when you need to modify, extend, or debug your code.

Also, it catches all exceptions, like SystemExit and Control-C, which can make it difficult to terminate your program.

What you should do is to catch specific exceptions.


try:
   # do your awesome stuff
except ValueException as exec:
   # Handle your exception
except NameError as exec:
   # Handle this exception here

Notice, you can have as many catchers (except) as you want for any exception type you need to handle.

#3 Not closing file

This is typical for beginners not to handle file opening proper.

f = open('file.txt', 'w')
f.write('some stuff')

Most Python interpreters close the file at the end of the program. But others don’t.

Even thought the Python interpreter closes it for you at the end, it takes up unnecessary resources from the underlying operating system, to keep a file pointer open after you needed. It can keep buffers of data that are needed to be written to storage. This can have undesired consequences if the program terminates unexpectedly.

This is especially a problem if your program is a service running and can also make the operating system run out of file pointers. It is bad practice not to close them after use.

The best practice is to use the with-statement.

with open('file.txt', 'w') as f:
    f.write('some stuff')
# Here the file pointer is closed.

This closes the file pointer after the with-statement.

#4 Not following PEP8 standard

We already talked about it, let’s talk about it again. You should always try to follow the PEP8 standard.

The reasons are clear.

  • It makes your code easy to understand for others – and yourself.

But what to do if you don’t know the PEP8 standard?

Well, I don’t know it by heart either, but there are checkers that can do that for you.

Install pep8 and let it check it for you and tell you what to correct.

pip install pep8

To check a python source file simply run and follow instructions of what to do.

pep8 python_file.py

#5 Dictionary Iteration with Keys

Actually, many beginners don’t know that you can iterate dictionaries in Python. But then they realize you can get all the keys in them.

capitals = {'Denmark': 'Copenhagen', 'Sweden': 'Stockholm', 'Norway': 'Oslo'}

for key in capitals.keys():
    print(key, capitals[key])

But if you want to get the key and values pairs you should do as follows.

for key, value in capitals.items():
    print(key, value)

#6 Not using List Comprehension

If you don’t know list comprehensions, then you often fall prey for making loops that make up lines of code not needed.

A simple example is converting a list of strings to a list of the same strings in lowercase.

my_list = ['UPPER', 'CASE', 'STUFF']

lower_case = []
for item in my_list:
    lower_case.append(item.lower())

This makes a simple transformation of a list quite difficult to read.

Using List Comprehension makes it easier.

lower_case = [item.lower() for item in my_list]

If you don’t know List Comprehension or want to learn more tricks with them, check out this the following post.

#7 Using range(len(…))

I have done this myself as a beginner. I simply didn’t know the awesome built-in functions in Python.

Say, you want to iterate over two lists simultaneously and process pairs of elements from each list. Then you might end up with the following code.

ist_0 = [1, 2, 3]
list_1 = [4, 5, 6]

for i in range(len(list_0)):
    print(list_0[i], (list_1[i])

This is bad practice and assumes that the length of the lists are the same.

The correct way to do this is to use the built-in function zip.

for i1, i2 in zip(list_0, list_1):
    print(i1, i2)

This also handles if one list is longer than the other.

If you want to learn more about the built-in functions in Python read the following guide.

#8 Use + to format strings

Most Python beginners get exited about how easy it is to concatenate strings in Python using the addition operator, that they end up using it for everything, like formatting strings.

name = "Rune"
my_str = "Hello " + name
print(my_str)

This is actually inefficient as it generates a new string for each addition. Also, it just looks bad.

What you should use is formatted strings.

my_str = f'Hello {name}'
print(my_str)

If you don’t know formatted strings, then you should read this guide.

#9 Using Non-explicit Variable Names

Here your PEP8 checker will catch you – but this is one of them that happens all the time. You create a variable with a single character.

x = 'my identifier'

While you code, it is the easiest choice. Especially, if the variable is just used in a small context.

But it makes your code difficult to understand and hard to debug.

Get those bad practices out of your system and give the variable meaningful names.

target_id = 'my identifier'

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

How to make a Formatted Word Cloud in 7 Steps

What will you learn?

At the end of this tutorial you will know how to make a formatted word cloud with Python like this one.

Step 1: Read content

The first thing you need is some content to make to make word frequency on.

In this example we will use the books of Sherlock Holmes – which are available in my GitHub here.

You can clone the repo or just download the full repository as a zip file from the green Code dropdown menu. Then you should see a folder with all the texts of holmes.

We will read them here.

import os

content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

Of course you can have any other set of text files.

The result in content is a list of the full content of text of each file. Each file will be raw text with new lines.

Step 2: Corpus in lower case

Here we will use the NLTK toolkit tokenize to get each word.

import nltk

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

This creates a list of each word in lower case.

We use list comprehension. If you are new to that check this tutorial.

Step 3: Remove stop words

Stop words are the words with no or little meaning. We do not want to include them in our word cloud, as they are common and take up a lot of space.

from nltk.corpus import stopwords

corpus = [w for w in corpus if w not in stopwords.words('english')]

Again we use list comprehension.

Step 4: Keep alphanumeric words

This can also be done by list comprehension.

corpus = [w for w in corpus if w.isalnum()]

Step 5: Lemmatize words

To lemmatize words is to get them in their root form. We don’t want to have the same word in different forms. We only need it in the basic form. This is what lemmatizing does.

from nltk.corpus import wordnet 
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

Again we use list comprehension to achieve the result.

Step 6: Create a Word Cloud

First we create a simple word cloud.

from wordcloud import WordCloud

unique_string = " ".join(corpus)

wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

This will create an image word_cloud.png similar to this one.

Step 7: Create a formatted Word Cloud

To do that we need a mask. We will use the cloud.png from the repository.

import numpy as np
from PIL import Image

unique_string_v2=(" ").join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width = 1000, height = 500, background_color="white",
               mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

This will generate a picture like this one.

Full code

You can get the full code from my GitHub repository.

If you clone it you get the full code as well as all the files you need.

import nltk
from nltk.corpus import stopwords
import os
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
import numpy as np
from PIL import Image

nltk.download('wordnet')
nltk.download('omw-1.4')

content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

corpus = [w for w in corpus if w not in stopwords.words('english')]

corpus = [w for w in corpus if w.isalnum()]


def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]


unique_string = " ".join(corpus)

wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

unique_string_v2 = " ".join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width=1000, height=500, background_color="white",
                      mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

7 f-strings Powerful Techniques That Will Blow Your Mind

What will you learn?

What is an f-string? Most know that but almost nobody knows the real power of f-strings. In this guide you will learn about it.

An f-string helps you get the string representations of variables.

name = 'Rune'
age = 32

print(f'Hi {name} you are {age} years old')

This will result in the output: ‘Hi Rune you are 32 years old’.

Most know that this is the structure of an f-string.

f'Some text {variable_a} and {variable_b}'

The structure.

  • It starts with f and has quotes afterwards: f’String content’ or f”String content”.
  • Then it will add the string representation of any variable within curly brackets f’String content {varriable_a}’

But there is more power to unleash.

#1 String representation of a class

This is actually a great way to know about Objects in general. If you implement a __str__(self) method it will be the string representation of object. And the best part is that f-string will get use that value as string representation of it.

class Item:
    def __init__(self, a):
        self.a = a

    def __str__(self):
        return str(self.a)

item = Item(12)

print(f'Item: {item}')

This will print ‘Item: 12‘.

#2 Date and time formatting

This is an awesome feature. You can format a date object as you wish.

from datetime import datetime

today = datetime.now()


print(f'Today is {today}')
# 'Today is 2022-04-13 13:13:47.090745'

print(f'Today is {today:%Y-%m-%d}')
# 'Today is 2022-04-13'

print(f'Time is {today:%H:%M%:%S}')
# 'Time is 13:13:47'

#3 Variable names

Another great one is you can actually include the variable names in the output. This is a great feature when you debug or add variables to the log.

x = 10
y = 20

print(f'{x = }, {y = }')
# 'x = 10, y = 20'

print(f'{x=}, {y=}')
# 'x=10, y=20'

#4 Class representation

Now this is not the same as the first one. An Object can have a class representation.

class Price:
    def __init__(self, item, price):
        self.item = item
        self.price = price
        
    def __str__(self):
        return f'{self.item} {self.price}'
    
    def __repr__(self):
        return f'Item {self.item}  costs {self.price} dollars'

p = Price('Car', 10)

print(f'{p}')
# 'Car 10'

print(f'{p!r}')
# 'Item Car  costs 10 dollars'

#5 Formatting specification

Now you can make a lot of formatting of the output.

Here are a few of them.

s = 'Hello, World!'

# Center output
print(f'{s:^40}')
# '             Hello, World!              '

# Left align
print(f'{s:<40}')
# 'Hello, World!                           '

# Right align
print(f'{s:>40}')
# '                           Hello, World!'

n = 9000000
print(f'{n:,}')
# '9,000,000'

print(f'{n:+015}')
# '+00000009000000'

#6 Nested f-strings

You can actually have f-strings within f-strings. This can have a few use-cases like these.

number = 254.3463

print(f"{f'${number:.2f}':>20s}")
# '             $254.35'

v = 3.1415
width = 10
precision = 3

print(f'output {v:{width}.{precision}}')
# 'output       3.14'

#7 Conditional formatting

There might be cases where this is useful.

v = 42.0

print(f'output {v:{"4.3" if v < 100 else "3.2"}}')
# 'output 42.0'

v = 142.0

f'output {v:{"4.3" if v < 100 else "3.2"}}'
# 'output 1.4e+02'

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

How to Use Generators in Python and 3 Use-cases That Simplify Your Code

What will you learn?

What is a Generator in Python and how to use them to work with large datasets in a Pythonic fashion.

What is a Generator?

A Generator is a function that returns a lazy iterator. Said, differently, you can iterate over the iterator, but it is lazy, that is, it will first execute the code when iterated.

A simple example could be as follows.

def my_generator():
    # Do something
    yield 5
    # Do something more
    yield 8
    # Do something else
    yield 12

Then you can iterate over the generator as follows.

for item in my_generator():
    print(item)

This will print 5, 8, and 12.

At first sight, this doesn’t look very useful. But let’s undestand it a bit better what happens.

When we make the first iteration in the for-loop, then it will execute the code in the my_generator function until it reaches the first yield.

Then it stops and returns the value after yield.

In the next iteration, it will continue where it left off and execute until it reaches the next yield.

Then it stops and returns the value after yield.

And so forth until no more yield statements are there.

Now why is that powerful?

Let’s explore some use-cases.

#1 Pre-processing a work item

If you have a pipeline of work items, where there is a pre-processing step. Often you would combine the pre-processing together with the actual processing. But actually, it will make your code more readable and maintainable if you divide it up.

Explore the example.

def pre_process_items():
    for row in open('data.txt'):
        row = row.strip()
        freq = {c: row.count(c) for c in set(row)}
        yield freq

freq = {}
for item in process_items():
    for k, v in item.items():
        freq[k] = freq.get(k, 0) + v

In this case you prepare the work item in pre_process_items().

If you want to learn about the Dict Comprehension read this guide.

This way you divide your code into a piece that prepares data and another one where you process the data. This makes the code easier to understand.

#2 Filtering work items

Often you have a list of work possible work items that need to be processed, but only a few of them actually need to be processed.

A simple example is processing a Log-file, where we are only interested in a specific log-level.

def get_warnings(log_file):
    for row in open(log_file):
        if 'WARNING' in row:
            yield row

for warning in get_warnings('log_file.txt'):
    print(warning)

This example shows how this simplifies how to filter.

If you want to learn more about text processing in Python read this guide.

#3 API calls

A great use-case is if you need to make an API call. This might require setup and filtering the result and possible reformatting.

import pandas_datareader as pdr
from datetime import datetime, timedelta

def get_stocks(tickers):
    d = datetime.now() - timedelta(days=7)
    for ticker in tickers:
        data = pdr.get_data_yahoo(ticker, d)
        close_price = list(data['Close'])
        yield close_price

for prices in get_stocks(['AAPL', 'TWTR']):
    print(prices)

The advantage of this, is, that it will first make the call to the API when you need the data (lazy load). Say, you have a list of 1000s of tickers, if you had to make all the calls before you can start to process, it could be a long waiting time.

With Generators you can utilize the power of lazy-loading.

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

15 String Methods That Changed the Way I Work With Text in Python

What will you learn?

It is (almost) impossible not to work with strings and text in Python at some point. In this guide you will learn 15 ways of working with strings that will help you be more efficient.

Make sure to check the last ones, they are awesome and I use them all the time.

#1 replace()

This is one of my favorite. How often do you need to change a substring to another. This can be as simple as changing a character to another.

input_str = 'this is my slug'
slug = input_str.replace(' ', '_')

This will give ‘this_is_my_slug’ in slug.

#2 split()

You have a string and want to split it into words. This can be done with split()

input_str = 'this is my awesome string'
words = input_str.split()

Then words will contain the list [‘this’, ‘is’, ‘my’, ‘awesome’, ‘string’].

Notice you can set the separator as you wish, say to comma as follows: split(‘,’) then it will separate on comma.

#3 join()

Another favorite. You split something than you want to join it again, right?

input_str = 'this is my awesome string'
my_list = input_str.split()

output_str = '_'.join(my_list)

Then you have output_str to be ‘this_is_my_awesome_string’.

#4 in

In what?

Yes, you want to check if something is a substring of another.

input_str = 'this is my awesome string'

if 'awesome' in input_str:
    print('awesome')

if 'dull' in input_str:
    print('dll')

This will only print awesome.

#5 strip()

Often when you work with strings they will contain spaces in front and at the end as well as a new line.

input_str = '   I love this    '

output_str = input_str.strip()

This will give ‘I love this’ in output_str.

#6 isdigit()

Want to check if a string is a digit value?

input_str = '313'
if input_str.isdigit():
    print(f'{input_str} is digit')

input_str = '313a'
if input_str.isdigit():
    print(f'{input_str} is digit'

This will only be True for the top one.

Read this guide to learn some awesome f-strings (used above).

#7 Concatenate strings

This is such a great feature of Python. Yes, you can simply concatenate strings with a plus sign.

str_a = 'This is Cool'
str_b = ' and Amazing'
output_str = str_a + str_b

This output_str will be ‘This is Cool and Amazing’.

If you notice, this could be done with join as well.

output_str = ''.join((str_a, str_b))

This will give the same result.

#8 Formatted strings

Formatted strings are amazing and makes it easy to output variables. Formatted strings will output string representation of a variable between curly brackets. The formatted string starts with an f.

var_a = 27
var_b = 'Icecream'
print(f'Bought {var_a} items of {var_b}')

This will output ‘Bought 27 items of Icecream’.

#9 lower()

Want a string in lower case.

input_str = 'Hi My Best Friend'

print(input_str.lower())

This will give ‘hi my best friend’.

#10 startswith()

When processing strings, you might want to check if it starts with a specific part.

url = 'https://www.foobar.com/'

if url.startswith('https://'):
    print('HTTPS')

This will obviously print HTTPS for url.

#11 endswith()

Almost similar.

url = 'https://www.foobar.com/'

if url.endswith('/'):
    print('Slash')

It will print Slash. Not from Guns’n’Roses – no not him.

#12 count()

How many occurrences of a substring?

print('foo bar foobar barfoo'.count('foo'))

This will print 3.

#13 splitlines()

This one is amazing when you read a full text with new lines.

text = '''This is my text
and it is long
over multiple lines'''

lines = text.splitlines()

This will have lines [‘This is my text’, ‘and it is long’, ‘over multiple lines’].

#14 Check if string only contains certain characters

I use this all the time. You need to check if a string only contains specific characters. These characters are not standard or very specific to your use case.

Here is how you can solve it.

ef valid_characters(string: str):
    legal_characters = 'abcdefghijklmnopqrstuvwxyz0123456789._-'
    if set(string) <= set(legal_characters):
        return True
    else:
        return False

print(valid_characters('FooBar'))
print(valid_characters('foobar'))
print(valid_characters('foo_bar'))

This will print False, True, True.

#15 Replace and remove last item in itemized string

Am I the only one, which uses this all the time?

It could be comma separated string, and you need to remove the last one of them.

def convert(string: str):
    return ','.join(string.split(',')[:-1])

print(convert('this,is,my,test,remove_me'))

This will print ‘this_is_my_test’.

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

21 Built-in Python Function That Increases Your Productivity

What will you learn?

Built-in functions help programmers do complex things they do all the time with one function. Here are the ones you need to know – and some of them I am sure you didn’t know.

#1 zip()

This is a favorite. If you have multiple lists with connected elements by position, then you can iterate over them as follows.

words = ['as', 'you', 'wish']
counts = [3, 2, 5]

for word, count in zip(words, counts):
    print(word, count)

Which gives.

as 3
you 2
wish 5

Also see how this can be done using List Comprehension.

#2 type()

I don’t really understand how little known this function is. It can help you a lot to understand your code by giving the types of your variables.

words = ['as', 'you', 'wish']
print(type(words))

a = 3.4
print(type(a))

This will give.

list
float

#3 sum()

Often you need to get the sum of an iterable like a list.

my_list = [43, 35, 2, 78, 23, 45, 56]

sum(my_list)

Printing 282 in this case.

#4 set()

If you have a list of elements and you only need to iterate over the unique elements, then set is a great built-in function.

my_list = [3, 2, 6, 3 ,6 ,3 ,5 ,2 ,5, 4, 6, 2, 8, 4, 3]

for item in set(my_list):
    print(item)

This will print the items 2, 3, 4, 5, 6, 8.

But only once each.

#5 list()

If you have an iterable and want it as a list.

my_list = [3, 2, 6, 3 ,6 ,3 ,5 ,2 ,5, 4, 6, 2, 8, 4, 3]

unique_items = list(set(my_list))

Then unique_items will be a list with the elements [2, 3, 4, 5, 6, 8].

#6 sorted()

If you have a list but want a sorted copy of it.

my_list = [3, 2, 6, 3 ,6 ,3 ,5 ,2 ,5, 4, 6, 2, 8, 4, 3]

sorted_list = sorted(my_list)

Which will be [2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 8].

#7 range()

I love this one and use it all the time. It will give you all the numbers in a range. I often use it with for-loops.

for i in range(10):
    print(i)

It will print the numbers 0, 1, …, 9.

You can also generate a list as follows.

my_list = list(range(10))

#8 round()

When you calculate with floats you often get a lot of digits that you actually don’t need. Then round is a great built-in function.

pi = 3.1415

print(round(pi, 2))

This will give you 3.14 only.

#9+10 min() and max()

Gives the minimum and maximum value of a list.

my_list = [3, 2, 6, 3, 6, 3, 5, 2, 5, 4, 6, 2, 8, 4, 3]

print(min(my_list))
print(max(my_list))

Which will print 2 and 8.

#11 map()

If you want to apply a function to all the elements in an iterable like a list.

def my_func(x):
    return 'x'*x

my_list = list(map(my_func, [1, 2, 3, 4]))

Here we also convert to a list with the elements from the map function [‘x’, ‘xx’, ‘xxx’, ‘xxxx’].

#12 isinstance()

Very useful to check the type of a variable.

Here we combine List Comprehension to filter all items of type str (string).

this_list = ['foo', 3, 4.14, 'bar', 'foobar']

my_list = [item for item in this_list if isinstance(item, str)]

Then my_list will contain [‘foo’, ‘bar’, ‘foobar’].

Also check the 7 List Comprehensions to improve your code.

#13 help()

Wow! Did you know that? You can get help.

help(isinstance)

This will give you.

Help on built-in function isinstance in module builtins:

isinstance(obj, class_or_tuple, /)
    Return whether an object is an instance of a class or of a subclass thereof.
    
    A tuple, as in ``isinstance(x, (A, B, ...))``, may be given as the target to
    check against. This is equivalent to ``isinstance(x, A) or isinstance(x, B)
    or ...`` etc.

#14-17 int(), str(), float(), and bool()

These are so useful and easy to use. They basically convert a variable to the given type.

a = '13'
# converts to an integer (here the string a)
b = int(a)

# converts to a float (here the string a)
c = float(a)

# converts to a string (here the float c)
d = str(c)

# converts to a bool - an empty list will be False
e = bool([])

# A list with items will be True
f = bool([2, 3, 4])

#18-19 any() and all()

Given a list of items that can be True of False, if you want to check if any element is True or all elements are True.

a_list = [False, False, True, False, False, False]

if any(a_list):
    print('at least one item is True')

if all(a_list):
    print('all elements are True in the list')

#20 abs()

If you work with numbers and values, then you must often need to take the absolute value. There is a built-in function that does that for you in Python.

Love to Python.

a = -12
b = abs(a) # will be 12

c = 12
d = abs(c)  # will be 12

#21 input()

Sometimes you need to interact with the user of your Python program from the command line. I try to avoid it, but sometimes it actually makes sense. Here Python has done it simple to achieve.

user_input = input('How do you feel?')

print(f'You feel {user_input}')

That is cool.

Check out this guide on f-strings (used above) if you want learn about them.

Final thoughts

The Python built-in functions are actually amazing. I used to program in C, then Java, and some other languages. But the built-in functions in Python just make a lot of simple tasks that are difficult to make in other languages easy to achieve.

Thank you Python.

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

7 Useful List Comprehensions You Didn’t Think Of

What will you learn?

Once you understand list comprehension they actually improve your code readability – still I often advice to comment what the list comprehension does.

First I will show you what List Comprehension is and how the basic case works including with an enclosed if-statement. Then I am going through the 7 use cases you din’t think of and some final thoughts and an alternatives.

What is List Comprehension?

A list comprehension in Python includes three elements.

  1. Expression The member itself, a call to a method, or any other valid expression that returns a value. In the example above, the expression i * i is the square of the member value.
  2. Member The object or value in the list or iterable. In the example above, the member value is i.
  3. Iterable A list, set, sequence, generator, or any other object that can return its elements one at a time. In the example above, the iterable is range(10).

I like to show some examples to explain it better. A List Comprehension is on the following form.

my_list = [do_this(element) for element in this_list]

Instead of this.

my_list = []
for element in this_list:
    my_list.append(element)

A List Comprehension with if-statement.

my_list = [do_this(element) for element in this_list if this_is_true(element)]

Instead of this.

my_list = []
for element in this_list:
    if this_is_true(element):
        my_list.append(element)

Now let’s go through the 7 use cases you didn’t think of.

#1 List Comprehension for Filtering

Say you want to filter all temperatures between 30 and 34 degrees (both excluded here).

temperaturs = [12, 32, 34, 36, 34, 12, 32]

filtered_temps = [t for t in temperaturs if 34 > t > 30]

This will give the items [32, 32] in filtered_temps.

Another example would to find all the strings that are digits.

alphanumeric = ["47", "abcd", "21st", "n0w4y", "test", "55123"]

filtered_aphanumeric = [int(string) for string in alphanumeric if string.isdigit()]

This will give the items [47, 55123]. Notice that we also convert them to integers.

#2 Combining Lists

If you have two lists and you want to combine all combinations from each list.

colors = ["red", "blue", "black"]
models = ["12", "12 mini", "12 Pro"]

combined = [(model, color) for model in models for color in colors]

This will give the following list in combined.

[('12', 'red'),
 ('12', 'blue'),
 ('12', 'black'),
 ('12 mini', 'red'),
 ('12 mini', 'blue'),
 ('12 mini', 'black'),
 ('12 Pro', 'red'),
 ('12 Pro', 'blue'),
 ('12 Pro', 'black')]

A list of tuples of all combinations. If you want to become better at working with strings in Python check this guide.

#3 Finding common elements

Imagine you have two lists and you want to find the elements which are in both lists.

students_a = ["Anna", "Elsa", "Tanja", "Freja", "Frigg"]
students_b = ["Ranja", "Natascha", "Anna", "Tanja"]

common = [student for student in students_a if student in students_b]

This will give you the following items in common.

['Anna', 'Tanja']

#4 Combining Elements with the Same Position

Imagine you have multiple lists with elements that are connected by position.

names = ["John", "Mary", "Lea"]
surnames = ["Smith", "Wonder", "Singer"]
ages = ["22", "19", "25"]

combined = [F"{name} {surname} - {age}" for name, surname, age in zip(names, surnames, ages)]

Then combined will be as follows.

['John Smith - 22', 'Mary Wonder - 19', 'Lea Singer - 25']

Also check out how zip can be used and other built-in functions in Python.

#5 Convert Values

Say you have a list of elements that all need to be converted. Using a function for the conversion can be convenient and also make the transformation of the list easy with List Comprehension.

def convert_to_dol(eur):
    return round(eur * 1.19, 2)


prices = [22.30, 12.00, 0.99, 1.10]
dollar_prices = [convert_to_dol(price) for price in prices]

This will give the following values in dollars_prices.

[26.54, 14.28, 1.18, 1.31]

#6 Frequency Count

I love this one. It is done on a Dict Comprehension, but is useful in many cases.

string = 'this is my string of letters that we will count'

freq = {c: string.count(c) for c in set(string)}

This will give the following dictionary in freq.

{'w': 2, 'n': 2, 'u': 1, 'e': 3, 't': 7, 'r': 2, 'h': 2, 'o': 2, 'm': 1, 'f': 1, 'i': 4, 's': 4, 'y': 1, 'l': 3, ' ': 9, 'a': 1, 'c': 1, 'g': 1}

#7 Generators

Generators are a great tool to master and can be combined with List Comprehension.

def return_next():
    for i in range(10):
        yield i
        
my_list = [i for i in return_next()]

This is a simple example but the power should not be underestimated from it.

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Final thoughts on List Comprehension

List Comprehension is one of the most popular paradigms in Python. That said, you should always keep readability in mind. If you create long and non-intuitive List Comprehensions, maybe you should construct it in another way. Your goal is to create easy to undrestand code – not complex code.

If you want to see a great use case of List Comprehension – then check out how to make a Word Cloud.

Want to learn more?

If this is something you like and you want to get started with Python, then check my 8 hours FREE video course with full explanations, projects on each levels, and guided solutions.

The course is structured with the following resources to improve your learning experience.

  • 17 video lessons teaching you everything you need to know to get started with Python.
  • 34 Jupyter Notebooks with lesson code and projects.
  • A FREE 70+ pages eBook with all the learnings from the lessons.

See the full FREE course page here.

The Ultimate Data Science Workflow Template

What will you learn?

Data Science Workflow

When it comes to creating a good Data Science Project you will need to ensure you cover a great deal of aspects. This template will show you what to cover and where to find more information on a specific topic.

The common pitfall for most junior Data Scientist is to focus on the very technical part of the Data Science Workflow. To add real value to the clients you need to focus on more steps, which are often neglected.

This guide will walk you through all steps and elaborate and link to in-depth content if you need more explanations.

Step 1: Acquire

  • Explore problem
  • Identify data
  • Import data

Step 1.a: Define Problem

If you are making a hoppy project, there might not be a definition of what you are trying to solve. But it is always good practice to start with it. Otherwise, you will most likely just do what you usually do and feel comfortable about. Try to sit down and figure it out.

It should be clear, that this step is before you have the data. That said, it often happens that a company has data and doesn’t know what to use it for.

Still, it all starts by defining a problem.

Here are some guidelines.

  • When defining a problem, don’t be too ambitious
    • Examples:
      • A green energy windmill producer need to optimize distribution and need better prediction on production based on weather forecasts
      • An online news media is interested in a story with how CO2 per capita around the world has evolved over the years
    • Both projects are difficult
      • For the windmill we would need data on production, maintenance periods, detailed weather data, just to get started.
      • The data for CO2 per capita is available on World Bank, but creating a visual story is difficult with our current capabilities
  • Hence, make a better research problem
    • You can start by considering a dataset and get inspiration
    • Examples of datasets
    • Example of Problem
      • What is the highest rated movie genre?

Data Science: Understanding the Problem

  • Get the right question:
    • What is the problem we try to solve?
    • This forms the Data Science problem
    • Examples
      • Sales figure and call center logs: evaluate a new product
      • Sensor data from multiple sensors: detect equipment failure
      • Customer data + marketing data: better targeted marketing
  • Assess situation
    • Risks, Benefits, Contingencies, Regulations, Resources, Requirement
  • Define goal
    • What is the objective?
    • What is the success criteria?
  • Conclusion
    • Defining the problem is key to successful Data Science projects

Step 1.b: Import libraries

When you work on project, you need somewhere have the data. A great place to start is by using pandas.

If you work in a JuPyter notebook you can run this in a cell to get started and follow this guide.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Step 1.c: Identify the Data

Great Places to Find Data

Step 1.d: Import Data

Read CSV files (Learn more here)

Excel files  (Learn more here)

  • Most videly used spreadsheet
  • Learn more about Excel processing in this lecture
  • read_excel() Read an Excel file into a pandas DataFrame.data = pd.read_excel('files/aapl.xlsx', index_col='Date')

Parquet files  (Learn more here)

  • Parquet is a free open source format
  • Compressed format
  • read_parquet() Load a parquet object from the file path, returning a DataFrame.data = pd.read_parquet('files/aapl.parquet')

Web Scraping (Learn more here)

  • Extracting data from websites
  • Leagal issues: wikipedia.org
  • read_html() Read HTML tables into a list of DataFrame objects.url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics" data = pd.read_html(url)

Databases (Learn more here)

  • read_sql() Read SQL query or database table into a DataFrame.
  • The sqlite3 is an interface for SQLite databases.import sqlite3 import pandas as pd conn = sqlite3.connect('files/dallas-ois.sqlite') data = pd.read_sql('SELECT * FROM officers', conn)

Step 1.e: Combine data

Also see guide here.

  • Often you need to combine data
  • Often we need to combine data from different sources

pandas DataFrames

  • pandas DataFrames can combine data (pandas cheat sheet)
  • concat([df1, df2], axis=0)concat Concatenate pandas objects along a particular axis 
  • df.join(other.set_index('key'), on='key')join Join columns of another DataFrame.
  • df1.merge(df2, how='inner', on='a') merge Merge DataFrame or named Series objects with a database-style join

Step 2: Prepare

  • Explore data
  • Visualize ideas
  • Cleaning data

Step 2.a: Explore data

  • head() Return the first n rows.
  • .shape Return a tuple representing the dimensionality of the DataFrame.
  • .dtypes Return the dtypes in the DataFrame.
  • info() Print a concise summary of a DataFrame.
  • describe() Generate descriptive statistics.
  • isna().any() Returns if any element is missing.

Step 2.b: Groupby, Counts and Statistics

Read the guide on statistics here.

  • Count groups to see the significance across resultsdata.groupby('Gender').count()
  • Return the mean of the values over the requested axis.data.groupby('Gender').mean()
  • Standard Deviation
    • Standard deviation is a measure of how dispersed (spread) the data is in relation to the mean.
    • Low standard deviation means data is close to the mean.
    • High standard deviation means data is spread out.
  • data.groupby('Gender').std()
  • Box plots
    • Box plots is a great way to visualize descriptive statistics
    • Notice that Q1: 25%, Q2: 50%, Q3: 75%
  • Make a box plot of the DataFrame columns plot.box()
data.boxplot()

Step 2.c: Visualize data

Read the guide on visualization for data science here.

Simple Plot

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot()
  • Adding title and labels
    • title='Tilte' adds the title
    • xlabel='X label' adds or changes the X-label
    • ylabel='X label' adds or changes the Y-labeldata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)')
  • Adding ranges
    • xlim=(min, max) or xlim=min Sets the x-axis range
    • ylim=(min, max) or ylim=min Sets the y-axis rangedata['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)', ylim=0)
  • Comparing datadata[['USA', 'WLD']].plot(ylim=0)

Scatter Plot

  • Good to see any connectiondata = pd.read_csv('files/sample_corr.csv') data.plot.scatter(x='x', y='y')

Histogram

  • Identifying qualitydata = pd.read_csv('files/sample_height.csv') data.plot.hist()
  • Identifying outliersdata = pd.read_csv('files/sample_age.csv') data.plot.hist()
  • Setting bins and figsizedata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.hist(figsize=(20,6), bins=10)

Bar Plot

  • Normal plotdata = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0) data['USA'].plot.bar()
  • Range and columns, figsize and labeldata[['USA', 'DNK']].loc[2000:].plot.bar(figsize=(20,6), ylabel='CO emmission per capita')

Pie Chart

  • Presentingdf = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3']) df.plot.pie()
  • Value counts in Pie Charts
    • colors=<list of colors>
    • labels=<list of labels>
    • title='<title>'
    • ylabel='<label>'
    • autopct='%1.1f%%' sets percentages on chart(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>= 17.5', '< 17.5'], title='CO2', autopct='%1.1f%%')

Step 2.d: Clean data

Read the data cleaning guide here.

  • dropna() Remove missing values.
  • fillna() Fill NA/NaN values using the specified method.
    • Example: Fill missing values with mean.data = data.fillna(data.mean())
  • drop_duplicates() Return DataFrame with duplicate rows removed.
  • Working with time series
    • reindex() Conform Series/DataFrame to new index with optional filling logic.
    • interpolate() Fill NaN values using an interpolation method.
  • Resources

Step 3: Analyze

  • Feature selection
  • Model selection
  • Analyze data

Step 3.a: Split into Train and Test

For an introduction to Machine Learning read this guide.

  • Assign dependent features (those predicting) to X
  • Assign classes (labels/independent features) to y
  • Divide into training and test setsfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3.b: Feature Scaling

Learn about Feature Scaling in this guide.

  • Feature Scaling transforms values in the similar range for machine learning algorithms to behave optimal.
  • Feature Scaling can be a problems for Machine Learing algorithms on multiple features spanning in different magnitudes.
  • Feature Scaling can also make it is easier to compare resultsFeature Scaling Techniques
  • Normalization is a special case of MinMaxScaler
    • Normalization: Converts values between 0-1(values - values.min())/(values.max() - values.min())
    • MinMaxScaler: Between any values
  • Standardization (StandardSclaer from sklearn)
    • Mean: 0, StdDev: 1(values - values.mean())/values.std()
    • Less sensitive to outliers

Normalization

  • MinMaxScaler Transform features by scaling each feature to a given range.
  • MinMaxScaler().fit(X_train) is used to create a scaler.
    • Notice: We only do it on training datafrom sklearn.preprocessing import MinMaxScaler norm = MinMaxScaler().fit(X_train) X_train_norm = norm.transform(X_train) X_test_norm = norm.transform(X_test)

Standarization

  • StandardScaler Standardize features by removing the mean and scaling to unit variance.from sklearn.preprocessing import StandardScaler scale = StandardScaler().fit(X_train) X_train_stand = scale.transform(X_train) X_test_stand = scale.transform(X_test)

Step 3.c: Feature Selection

Learn about Feature Selection in this guide.

  • Feature selection is about selecting attributes that have the greatest impact towards the problem you are solving.

Why Feature Selection?

  • Higher accuracy
  • Simpler models
  • Reducing overfitting risk

Feature Selection Techniques

Filter methods
  • Independent of Model
  • Based on scores of statistical
  • Easy to understand
  • Good for early feature removal
  • Low computational requirements
Examples
Wrapper methods
  • Compare different subsets of features and run the model on them
  • Basically a search problem
Examples

See more on wikipedia

Embedded methods
  • Find features that contribute most to the accuracy of the model while it is created
  • Regularization is the most common method – it penalizes higher complexity
Examples

Remove constant and quasi constant features

  • VarianceThreshold Feature selector that removes all low-variance features.from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold() sel.fit_transform(data)Remove correlated features
  • The goal is to find and remove correlated features
  • Calcualte correlation matrix (assign it to corr_matrix)
  • A feature is correlated to any previous features if the following is true
    • Notice that we use correlation 0.8corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]

Step 3.d: Model Selection

Learn about Model Selection in this guide.

  • The process of selecting the model among a collection of candidates machine learning models

Problem type

  • What kind of problem are you looking into?
    • ClassificationPredict labels on data with predefined classes
      • Supervised Machine Learning
    • ClusteringIdentify similarieties between objects and group them in clusters
      • Unsupervised Machine Learning
    • RegressionPredict continuous values
      • Supervised Machine Learning
  • Resource: Sklearn cheat sheet

Model Selection Techniques

  • Probabilistic Measures: Scoring by performance and complexity of model.
  • Resampling Methods: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

A few models

  • LinearRegression Ordinary least squares Linear Regression (Lesson 08).from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score lin = LinearRegression() lin.fit(X_train, y_train) y_pred = lin.predict(X_test) r2_score(y_test, y_pred)
  • SVC C-Support Vector Classification (Lesson 10).from sklearn.svm import SVC, LinearSVC from sklearn.metrics import accuracy_score svc = LinearSVC() svc.fit(X_train, y_train) y_pred = svc.predict(X_test) accuracy_score(y_test, y_pred)
  • KNeighborsClassifier Classifier implementing the k-nearest neighbors vote (Lesson 10).from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score neigh = KNeighborsClassifier() neigh.fit(X_train.fillna(-1), y_train) y_pred = neigh.predict(X_test.fillna(-1)) accuracy_score(y_test, y_pred)

Step 3.e: Analyze Result

This is the main check-point of your analysis.

  • Review the Problem and Data Science problem you started with.
    • The analysis should add value to the Data Science Problem
    • Sometimes our focus drifts – we need to ensure alignment with original Problem.
    • Go back to the Exploration of the Problem – does the result add value to the Data Science Problem and the initial Problem (which formed the Data Science Problem)
    • Example: As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
  • Did we learn anything?
    • Does the Data-Driven Insights add value?
    • Example: Does it add value to have evidence for: Wealthy people buy more expensive cars.
      • This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
  • Can we make any valuable insights from our analysis?
    • Do we need more/better/different data?
    • Can we give any Actionable Data Driven Insights?
    • It is always easy to want better and more accurate high quality data.
  • Do we have the right features?
    • Do we need eliminate features?
    • Is the data cleaning appropriate?
    • Is data quality as expected?
  • Do we need to try different models?
    • Data Analysis is an iterative process
    • Simpler models are more powerful
  • Can result be inconclusive?
    • Can we still give recommendations?

Quote

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

  • Sherlock Holmes

Iterative Research Process

  • Observation/Question: Starting point (could be iterative)
  • Hypothesis/Claim/Assumption: Something we believe could be true
  • Test/Data collection: We need to gether relevant data
  • Analyze/Evidence: Based on data collection did we get evidence?
    • Can our model predict? (a model is first useful when it can predict)
  • ConcludeWarning! E.g.: We can conclude a correlation (this does not mean A causes B)
    • Example: Based on the collected data we can see a correlation between A and B

Step 4: Report

  • Present findings
  • Visualize results
  • Credibility counts

Step 4.a: Present Findings

  • You need to sell or tell a story with the findings.
  • Who is your audience?
    • Focus on technical level and interest of your audience
    • Speak their language
    • Story should make sense to audience
    • Examples
      • Team manager: Might be technical, but often busy and only interested in high-level status and key findings.
      • Data engineer/science team: Technical exploration and similar interest as you
      • Business stakeholders: This might be end-customers or collaboration in other business units.
  • When presenting
    • Goal: Communicate actionable insights to key stakeholders
    • Outline (inspiration):
      • TL;DR (Too-long; Didn’t read) – clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
      • Start with your understanding of the business problem
      • How does it transform into a Data Science Problem
      • How will to measure impact – what business metrics are indicators of results
      • What data is available and used
      • Presenting hypthosis of reseach
      • A visual presentation of the insights (model/analysis/key findings)
        • This is where you present the evidence for the insights
      • How to use insight and create actions
      • Followup and continuous learning increasing value

Step 4.b: Visualize Results

  • Telling a story with the data
  • This is where you convince that the findings/insights are correct
  • The right visualization is important
    • Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.

Resources for visualization

  • Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Plotly open-source for analytic apps in Python
  • Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

Step 4.c: Credibility Counts

  • This is the check point if your research is valid
    • Are you hiding findings you did not like (not supporting your hypothesis)?
    • Remember it is the long-term relationship that counts
  • Don’t leave out results
    • We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective

Step 5: Actions

  • Use insights
  • Measure impact
  • Main goal

Step 5.a: Use Insights

  • How do we follow up on the presented Insights?
  • No one-size-fits-all: It depends on the Insights and Problem
  • Examples:
    1. Problem: What customers are most likely to cancel subscription?
      • Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
      • But you should still try to add value
    2. Problem: Here is our data – find valuable insights!
      • This is a challenge as there is no given focus
      • An iterative process involving the customer can leave you with no surprises

Step 5.b: Measure Impact

  • If customer cannot measure impact of your work – they do not know what they pay for.
    • If you cannot measure it – you cannot know if hypothesis are correct.
    • A model is first valuable when it can be used to predict with some certainty
  • There should be identified metrics/indicators to evaluate in the report
  • This can evolve – we learn along the way – or we could be wrong.
  • How long before we expect to see impact on identified business metrics?
  • What if we do not see expected impact?
  • Understanding of metrics
    • The metrics we measure are indicators that our hypthesis is correct
    • Other aspects can have impact on the result – but you need to identify that

Main Goal

  • Your success of a Data Scientist is to create valuable actionable insights

A great way to think

  • Any business/organisation can be thought of as a complex system
    • Nobody understands it perfectly and it evolves organically
  • Data describes some aspect of it
  • It can be thought of as a black-box
  • Any insights you can bring is like a window that sheds light on what happens inside

Want to learn more?

Want to learn more about Data Science to become a successful Data Scientist?

This is one lesson of a 15 part Expert Data Science Blueprint course with the following resources.

  • 15 video lessons – covers the Data Science Workflow and concepts, demonstrates everything on real data, introduce projects and shows a solution (YouTube video).
  • 30 JuPyter Notebooks – with the full code and explanation from the lectures and projects (GitHub).
  • 15 projects – structured with the Data Science Workflow and a solution explained in the end of video lessons (GitHub).
Data Science
Exit mobile version