How to make a Formatted Word Cloud in 7 Steps

What will you learn?

At the end of this tutorial you will know how to make a formatted word cloud with Python like this one.

Step 1: Read content

The first thing you need is some content to make to make word frequency on.

In this example we will use the books of Sherlock Holmes – which are available in my GitHub here.

You can clone the repo or just download the full repository as a zip file from the green Code dropdown menu. Then you should see a folder with all the texts of holmes.

We will read them here.

import os
content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

Of course you can have any other set of text files.

The result in content is a list of the full content of text of each file. Each file will be raw text with new lines.

Step 2: Corpus in lower case

Here we will use the NLTK toolkit tokenize to get each word.

import nltk
corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

This creates a list of each word in lower case.

We use list comprehension. If you are new to that check this tutorial.

Step 3: Remove stop words

Stop words are the words with no or little meaning. We do not want to include them in our word cloud, as they are common and take up a lot of space.

from nltk.corpus import stopwords
corpus = [w for w in corpus if w not in stopwords.words('english')]

Again we use list comprehension.

Step 4: Keep alphanumeric words

This can also be done by list comprehension.

corpus = [w for w in corpus if w.isalnum()]

Step 5: Lemmatize words

To lemmatize words is to get them in their root form. We don’t want to have the same word in different forms. We only need it in the basic form. This is what lemmatizing does.

from nltk.corpus import wordnet 
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)
corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

Again we use list comprehension to achieve the result.

Step 6: Create a Word Cloud

First we create a simple word cloud.

from wordcloud import WordCloud
unique_string = " ".join(corpus)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

This will create an image word_cloud.png similar to this one.

Step 7: Create a formatted Word Cloud

To do that we need a mask. We will use the cloud.png from the repository.

import numpy as np
from PIL import Image
unique_string_v2=(" ").join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width = 1000, height = 500, background_color="white",
               mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

This will generate a picture like this one.

Full code

You can get the full code from my GitHub repository.

If you clone it you get the full code as well as all the files you need.

import nltk
from nltk.corpus import stopwords
import os
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
import numpy as np
from PIL import Image
nltk.download('wordnet')
nltk.download('omw-1.4')
content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())
corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])
corpus = [w for w in corpus if w not in stopwords.words('english')]
corpus = [w for w in corpus if w.isalnum()]

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

unique_string = " ".join(corpus)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")
unique_string_v2 = " ".join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width=1000, height=500, background_color="white",
                      mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

2 Replies to “How to make a Formatted Word Cloud in 7 Steps”

Leave a Reply

%d bloggers like this: