How to build a clean word cloud using pytagcloud without a crowded image - Python

Question

In a previous question, i asked the community on how to count the frequency of each consecutive two words in a sentence and I got a great answer! now I'm trying to build a word cloud out from the results using the package,pytagcloud.

The issue that I do have is that the pictures produced is crowded and words are smooching together. any idea if there's a function to separate words and make them readable or if there's any alternative way to do that in python.
Thanks!

My code is bellow. this is the link of the text I used for test I tried to use a smaller number of word combination but that didn't change the crowdness of the text in the picture.
I also added few function like playing with "layout" and "size" and "fontname='Lobster' and fontzoom=1" but none of them give the optimal results which is a clean word cloud picture where the words are not crowded.

import operator
import urllib2

from roundup.backends.indexer_common import STOPWORDS
import requests, collections, bs4
Data = "TEXT FROM The link above- TEXT file"
two_words = [' '.join(ws) for ws in zip(Data, Data[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 12}
sorted_wordscount = sorted(wordscount.iteritems(), key=operator.itemgetter(1))

print sorted_wordscount;

from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS, LAYOUT_MIX, LAYOUT_VERTICAL, LAYOUT_MOST_HORIZONTAL, LAYOUT_MOST_VERTICAL
from pytagcloud.colors import COLOR_SCHEMES
from pytagcloud.lang.counter import get_tag_counts

create_tag_image(make_tags(sorted_wordscount), 'filename.png', size=(1300,1150), background=(0, 0, 0, 255), layout=LAYOUT_MIX, fontname='Molengo', rectangular=True)

This is an example of the output results I get : HERE
The optimal result will be something similar to one of the images HERE

vinaut · Accepted Answer · 2013-10-05T17:48:57.130

You are sorting the tags in ascending order instead of descending, as probably pytagcloud expects. You should change the sorting line to:

sorted_wordscount = sorted(wordscount.iteritems(), key=operator.itemgetter(1),reverse=True)

Once that is fixed, the key parameter is maxsize in make_tags :

create_tag_image(make_tags(sorted_wordscount[:],maxsize=200), 'filename.png', size=(1300,1150), background=(0, 0, 0, 255), layout=LAYOUT_MIX, fontname='Molengo', rectangular=True)

If I understand correctly this sets the maximum font size (that of the tag with the highest frequency) and it calculates all the other sizes in relation to this one. The other parameter that influences how the strings are distributed is the size of the window.

You will have to play with these parameters.

Take into account that the library function get_tag_counts does more than just returning the frequency : it also filters common words, apply lowercase, and in general should give you a better distribution of tags than a simple sorting, as you are doing.

With these changes you should get something like this (obtained with get_tag_counts over the file you linked in your post, in a 1000x1000 window, maxsize=260 and capping to the first 50 tags):

enter image description here

Edit - As requested, the code for creating the image above :

import operator
import os
import urllib2

from roundup.backends.indexer_common import STOPWORDS
import requests, collections, bs4
with open("./const11.txt") as file:
  Data1 = file.read().lower()
  Data = Data1.split()
two_words = [' '.join(ws) for ws in zip(Data, Data[1:])]
wordscount = {w:f for w, f in collections.Counter(two_words).most_common() if f > 5}
sorted_wordscount = sorted(wordscount.iteritems(), key=operator.itemgetter(1),reverse=True)

from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS, LAYOUT_MIX, LAYOUT_VERTICAL, LAYOUT_MOST_HORIZONTAL, LAYOUT_MOST_VERTICAL
from pytagcloud.colors import COLOR_SCHEMES
from pytagcloud.lang.counter import get_tag_counts

tags = make_tags(get_tag_counts(Data1)[:50],maxsize=260)
create_tag_image(tags,'filename.png', size=(1000,1000), background=(0, 0, 0, 255), layout=LAYOUT_MIX, fontname='Lobster', rectangular=True)`

Using python 2.7.5, on Ubuntu 13.04 with pygame installed with apt-get, and the rest of the packages with pip. "const11.txt" is the text file linked in the question.

Hi vinaut!!! thank you very much for your great answer!!! I tried to replicate the results but i failed, and your cloud looks 1000 times better than mine! Can you please post your code so I can see what I did wrong? again, Thank you very much!!!! — mongotop, Oct 05 '13 at 17:29
No worries, edited the answer with the code used for generating the image. — vinaut, Oct 05 '13 at 17:43
Thank you very much vinaut! PS - You have some magic in your laptop! :) http://imgur.com/CmoOB7y this is the best what I could get using maxsize=50 for 25 words, size=(1300,1100). I don't know why it doesn't make the words in a rectangle like yours, even if rectangular=True. — mongotop, Oct 06 '13 at 08:28
I'm sorry it's not working out for you: that is the exact code I used. You might have some package that is a different version than mine, and screwing things up: as a last resort, you can try to use a virtual machine (or a ec2 instance) with a clean Ubuntu installation, and install all the packages as I did. — vinaut, Oct 06 '13 at 10:04
You are right! The versions might be the issue. Can you please post the version of the packages you are using for this script? I will really appreciate it! Thank you very much for your support! — mongotop, Oct 06 '13 at 20:27
As I said, I installed everything from scratch (python, pygame, all the packages) on a fresh Ubuntu virtual machine, with apt-get and pip. Do the same and you should get the same result :) — vinaut, Oct 11 '13 at 15:32

Alanyst · Answer 2 · 2013-10-03T14:58:20.627

EDIT: While the TAG_PADDING parameter referenced below in my answer might be of interest for some cases, vinaut's answer is clearly the better one to start with.

Looking at https://github.com/atizo/PyTagCloud/blob/master/pytagcloud/__init__.py, it looks like TAG_PADDING might be the parameter that controls the spacing between words.

Because it's set to a literal value in the source code and it's referenced in several places, you will either have to alter the source code to a parameter that suits you better (and repackage/reinstall) or else copy the source into your own project and alter it accordingly.

How to build a clean word cloud using pytagcloud without a crowded image - Python

2 Answers2