Python

How to make a Formatted Word Cloud in 7 Steps

What will you learn?

At the end of this tutorial you will know how to make a formatted word cloud with Python like this one.

Step 1: Read content

The first thing you need is some content to make to make word frequency on.

In this example we will use the books of Sherlock Holmes – which are available in my GitHub here.

You can clone the repo or just download the full repository as a zip file from the green Code dropdown menu. Then you should see a folder with all the texts of holmes.

We will read them here.

import os

content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

Of course you can have any other set of text files.

The result in content is a list of the full content of text of each file. Each file will be raw text with new lines.

Step 2: Corpus in lower case

Here we will use the NLTK toolkit tokenize to get each word.

import nltk

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

This creates a list of each word in lower case.

We use list comprehension. If you are new to that check this tutorial.

Step 3: Remove stop words

Stop words are the words with no or little meaning. We do not want to include them in our word cloud, as they are common and take up a lot of space.

from nltk.corpus import stopwords

corpus = [w for w in corpus if w not in stopwords.words('english')]

Again we use list comprehension.

Step 4: Keep alphanumeric words

This can also be done by list comprehension.

corpus = [w for w in corpus if w.isalnum()]

Step 5: Lemmatize words

To lemmatize words is to get them in their root form. We don’t want to have the same word in different forms. We only need it in the basic form. This is what lemmatizing does.

from nltk.corpus import wordnet 
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]

Again we use list comprehension to achieve the result.

Step 6: Create a Word Cloud

First we create a simple word cloud.

from wordcloud import WordCloud

unique_string = " ".join(corpus)

wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

This will create an image word_cloud.png similar to this one.

Step 7: Create a formatted Word Cloud

To do that we need a mask. We will use the cloud.png from the repository.

import numpy as np
from PIL import Image

unique_string_v2=(" ").join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width = 1000, height = 500, background_color="white",
               mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")

This will generate a picture like this one.

Full code

You can get the full code from my GitHub repository.

If you clone it you get the full code as well as all the files you need.

import nltk
from nltk.corpus import stopwords
import os
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
import numpy as np
from PIL import Image

nltk.download('wordnet')
nltk.download('omw-1.4')

content = []
for filename in os.listdir('holmes/'):
    with open(f'holmes/{filename}') as f:
        content.append(f.read())

corpus = []
for item in content:
    corpus.extend([word.lower() for word in nltk.word_tokenize(item)])

corpus = [w for w in corpus if w not in stopwords.words('english')]

corpus = [w for w in corpus if w.isalnum()]


def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]


unique_string = " ".join(corpus)

wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")

unique_string_v2 = " ".join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width=1000, height=500, background_color="white",
                      mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")
Rune

View Comments

Recent Posts

Build and Deploy an AI App

Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…

4 days ago

Building Python REST APIs with gcloud Serverless

Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…

4 days ago

Accelerate Your Web App Development Journey with Python and Docker

App Development with Python using Docker Are you an aspiring app developer looking to level…

5 days ago

Data Science Course Made Easy: Unlocking the Path to Success

Why Value-driven Data Science is the Key to Your Success In the world of data…

1 week ago

15 Machine Learning Projects: From Beginner to Pro

Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…

2 weeks ago

Unlock the Power of Python: 17 Project-Based Lessons from Zero to Machine Learning

Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…

2 weeks ago