At the end of this tutorial you will know how to make a formatted word cloud with Python like this one.
The first thing you need is some content to make to make word frequency on.
In this example we will use the books of Sherlock Holmes – which are available in my GitHub here.
You can clone the repo or just download the full repository as a zip file from the green Code dropdown menu. Then you should see a folder with all the texts of holmes.
We will read them here.
import os
content = []
for filename in os.listdir('holmes/'):
with open(f'holmes/{filename}') as f:
content.append(f.read())
Of course you can have any other set of text files.
The result in content is a list of the full content of text of each file. Each file will be raw text with new lines.
Here we will use the NLTK toolkit tokenize to get each word.
import nltk
corpus = []
for item in content:
corpus.extend([word.lower() for word in nltk.word_tokenize(item)])
This creates a list of each word in lower case.
We use list comprehension. If you are new to that check this tutorial.
Stop words are the words with no or little meaning. We do not want to include them in our word cloud, as they are common and take up a lot of space.
from nltk.corpus import stopwords
corpus = [w for w in corpus if w not in stopwords.words('english')]
Again we use list comprehension.
This can also be done by list comprehension.
corpus = [w for w in corpus if w.isalnum()]
To lemmatize words is to get them in their root form. We don’t want to have the same word in different forms. We only need it in the basic form. This is what lemmatizing does.
from nltk.corpus import wordnet
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]
Again we use list comprehension to achieve the result.
First we create a simple word cloud.
from wordcloud import WordCloud
unique_string = " ".join(corpus)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")
This will create an image word_cloud.png similar to this one.
To do that we need a mask. We will use the cloud.png from the repository.
import numpy as np
from PIL import Image
unique_string_v2=(" ").join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width = 1000, height = 500, background_color="white",
mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")
This will generate a picture like this one.
You can get the full code from my GitHub repository.
If you clone it you get the full code as well as all the files you need.
import nltk
from nltk.corpus import stopwords
import os
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
import numpy as np
from PIL import Image
nltk.download('wordnet')
nltk.download('omw-1.4')
content = []
for filename in os.listdir('holmes/'):
with open(f'holmes/{filename}') as f:
content.append(f.read())
corpus = []
for item in content:
corpus.extend([word.lower() for word in nltk.word_tokenize(item)])
corpus = [w for w in corpus if w not in stopwords.words('english')]
corpus = [w for w in corpus if w.isalnum()]
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
corpus = [WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)) for w in corpus]
unique_string = " ".join(corpus)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
wordcloud.to_file("word_cloud.png")
unique_string_v2 = " ".join(corpus)
cloud_mask = np.array(Image.open("cloud.png"))
wordcloud = WordCloud(width=1000, height=500, background_color="white",
mask=cloud_mask, max_words=5000, contour_width=2, contour_color='black')
wordcloud.generate(unique_string_v2)
wordcloud.to_file("word_cloud_masked.png")
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…
View Comments
very nice article, thank you Rune!
Thanks Jens.