Algorithm for classifying words for hangman difficulty levels as "Easy","Medium", or "Hard"

Question

What is a good algorithm to determine the "difficulty" of a word for a hangman game, so that the game can select words to match a specified difficulty level?

Difficulty would seem related to the number of guesses required, the relative frequency of usage of letters (e.g. words with many uncommon letters may be harder to guess), and potentially the length of the word.

There are also some subjective factors to (attempt to) compensate for, such as the likelihood a word is in the vocabulary of the player, and can be recognised, allowing moving from a guessing strategy based on letter frequencies alone to guessing based on list of known matching words.

My attempt for now is below in ruby. Any suggestions on how to improve the categorisation?

def classify_word(w)
  n = w.chars.to_a.uniq.length # Num. unique chars in w
  if n < 5 and w.length > 4
    return WordDifficulty::Easy
  end
  if n > w.length / 2
    return WordDifficulty::Hard
  else
    return WordDifficulty::Medium
  end
end

I am writing a hangman game I would like my children to play; I am rather too old to be attempting "homework", which may be why the question is receiving so many down votes... Words are drawn randomly from large word databases, which include many obscure words, and are being filtered by the difficulty level determined for the word.

Why the downvotes? This is a decent question. I would make a difficulty function like `f(w) = (# unique letters) * (7 - # vowels) * (sum of the positions of unique letters in a list, ordered by frequency)`. From there, you can just split the range of the function into three segments and call those your difficulties. — Blender, Apr 25 '13 at 19:44
I would suggest you do a web search for this -- likely there are algorithms or dictionaries that purport to compute/report the complexity of the word. I know there are for longer text. — Hot Licks, Apr 25 '13 at 19:48
Related: http://www.youtube.com/watch?v=bBLm9P-ph6U (QI XL - The Hardest Word to Guess in Hangman) — Claus Jørgensen, Apr 25 '13 at 22:15
Whatver you do, be sure to include EXTINCTIONSPECTROPHOTOPOLERISCOPEOCCULOGRAVOGYROKYNETOMETER. — Hot Licks, Apr 25 '13 at 22:31
For users who may not be familiar with Ruby, maybe you want to explain what the first line of your method does? `n = w.chars.to_a.uniq.length` Does it count the number of unique letters? — T Nguyen, Apr 26 '13 at 01:51
Is this an online game? Perhaps you could see how many people are able to solve a given problem and use that as a barometer. — Carlos, Apr 26 '13 at 08:32
Have you checked that simply ranking words by how common they for their length is insufficient? — Jack V., Apr 26 '13 at 13:12
I wouldn't rank words with many different chars as more difficult. Consider the extreme case of "abcdefghijklmnopqrstuvwxyz" (if that were a valid word). You can't lose a hangman game with that word as there are no invalid guesses, yet you are ranking it as Hard. — Guillermo Vasconcelos, Apr 26 '13 at 13:52
Short words, particularly with uncommon letters or no repeated letters, can be quite hard as well. "QUAY" would have a higher difficulty than "QUARTER". — GalacticCowboy, Apr 30 '13 at 19:45

score 92 · Accepted Answer · edited May 23 '17 at 11:46

1. Introduction

Here's a way to approach this problem systematically: if you have an algorithm that plays hangman well, then you can take the difficulty of each word to be the number of wrong guesses that your program would take if guessing that word.

2. Aside on hangman strategy

There's an idea that's implicit in some the other answers and comments, that the optimal strategy for the solver would be to base their decisions on the frequency of letters in English, or on the frequency of words in some corpus. This is a seductive idea, but it's not quite right. The solver does best if it accurately models the distribution of words chosen by the setter, and a human setter may well be choosing words based on their rarity or avoidance of frequently used letters. For example, although E is the most frequently used letter in English, if the setter always chooses from the words JUGFUL, RHYTHM, SYZYGY, and ZYTHUM, then a perfect solver does not start by guessing E!

The best approach to modelling the setter depends on the context, but I guess that some kind of Bayesian inductive inference would work well in a context where the solver plays many games against the same setter, or against a group of similar setters.

3. A hangman algorithm

Here I'll outline a solver that is pretty good (but far from perfect). It models the setter as choosing words uniformly from a fixed dictionary. It's a greedy algorithm: at each stage it guesses the letter that minimizes the number of misses, that is, words that do not contain the guess. For example, if no guesses have been made so far, and the possible words are DEED, DEAD and DARE, then:

if you guess D or E, there are no misses;
if you guess A, there's one miss (DEED);
if you guess R, there are two misses (DEED and DEAD);
if you guess any other letter, there are three misses.

So either D or E is a good guess in this situation.

(Thanks to Colonel Panic in comments for pointing out that correct guesses are free in hangman—I totally forgot this in my first attempt!)

4. Implementation

Here's an implementation of this algorithm in Python:

from collections import defaultdict
from string import ascii_lowercase

def partition(guess, words):
    """Apply the single letter 'guess' to the sequence 'words' and return
    a dictionary mapping the pattern of occurrences of 'guess' in a
    word to the list of words with that pattern.

    >>> words = 'deed even eyes mews peep star'.split()
    >>> sorted(list(partition('e', words).items()))
    [(0, ['star']), (2, ['mews']), (5, ['even', 'eyes']), (6, ['deed', 'peep'])]

    """
    result = defaultdict(list)
    for word in words:
        key = sum(1 << i for i, letter in enumerate(word) if letter == guess)
        result[key].append(word)
    return result

def guess_cost(guess, words):
    """Return the cost of a guess, namely the number of words that don't
    contain the guess.

    >>> words = 'deed even eyes mews peep star'.split()
    >>> guess_cost('e', words)
    1
    >>> guess_cost('s', words)
    3

    """
    return sum(guess not in word for word in words)

def word_guesses(words, wrong = 0, letters = ''):
    """Given the collection 'words' that match all letters guessed so far,
    generate tuples (wrong, nguesses, word, guesses) where
    'word' is the word that was guessed;
    'guesses' is the sequence of letters guessed;
    'wrong' is the number of these guesses that were wrong;
    'nguesses' is len(guesses).

    >>> words = 'deed even eyes heel mere peep star'.split()
    >>> from pprint import pprint
    >>> pprint(sorted(word_guesses(words)))
    [(0, 1, 'mere', 'e'),
     (0, 2, 'deed', 'ed'),
     (0, 2, 'even', 'en'),
     (1, 1, 'star', 'e'),
     (1, 2, 'eyes', 'en'),
     (1, 3, 'heel', 'edh'),
     (2, 3, 'peep', 'edh')]

    """
    if len(words) == 1:
        yield wrong, len(letters), words[0], letters
        return
    best_guess = min((g for g in ascii_lowercase if g not in letters),
                     key = lambda g:guess_cost(g, words))
    best_partition = partition(best_guess, words)
    letters += best_guess
    for pattern, words in best_partition.items():
        for guess in word_guesses(words, wrong + (pattern == 0), letters):
            yield guess

5. Example results

Using this strategy it's possible to evaluate the difficulty of guessing each word in a collection. Here I consider the six-letter words in my system dictionary:

>>> words = [w.strip() for w in open('/usr/share/dict/words') if w.lower() == w]
>>> six_letter_words = set(w for w in words if len(w) == 6)
>>> len(six_letter_words)
15066
>>> results = sorted(word_guesses(six_letter_words))

The easiest words to guess in this dictionary (together with the sequence of guesses needed for the solver to guess them) are as follows:

>>> from pprint import pprint
>>> pprint(results[:10])
[(0, 1, 'eelery', 'e'),
 (0, 2, 'coneen', 'en'),
 (0, 2, 'earlet', 'er'),
 (0, 2, 'earner', 'er'),
 (0, 2, 'edgrew', 'er'),
 (0, 2, 'eerily', 'el'),
 (0, 2, 'egence', 'eg'),
 (0, 2, 'eleven', 'el'),
 (0, 2, 'enaena', 'en'),
 (0, 2, 'ennead', 'en')]

and the hardest words are these:

>>> pprint(results[-10:])
[(12, 16, 'buzzer', 'eraoiutlnsmdbcfg'),
 (12, 16, 'cuffer', 'eraoiutlnsmdbpgc'),
 (12, 16, 'jugger', 'eraoiutlnsmdbpgh'),
 (12, 16, 'pugger', 'eraoiutlnsmdbpcf'),
 (12, 16, 'suddle', 'eaioulbrdcfghmnp'),
 (12, 16, 'yucker', 'eraoiutlnsmdbpgc'),
 (12, 16, 'zipper', 'eraoinltsdgcbpjk'),
 (12, 17, 'tuzzle', 'eaioulbrdcgszmnpt'),
 (13, 16, 'wuzzer', 'eraoiutlnsmdbpgc'),
 (13, 17, 'wuzzle', 'eaioulbrdcgszmnpt')]

The reason that these are hard is because after you've guessed -UZZLE, you still have seven possibilities left:

>>> ' '.join(sorted(w for w in six_letter_words if w.endswith('uzzle')))
'buzzle guzzle muzzle nuzzle puzzle tuzzle wuzzle'

6. Choice of wordlist

Of course when preparing wordlists for your children you wouldn't start with your computer's system dictionary, you'd start with a list of words that you think they are likely to know. For example, you might have a look at Wiktionary's lists of the most frequently used words in various English corpora.

For example, among the 1,700 six-letter words in the 10,000 most common words in Project Gutenberg as of 2006, the most difficult ten are these:

[(6, 10, 'losing', 'eaoignvwch'),
 (6, 10, 'monkey', 'erdstaoync'),
 (6, 10, 'pulled', 'erdaioupfh'),
 (6, 10, 'slaves', 'erdsacthkl'),
 (6, 10, 'supper', 'eriaoubsfm'),
 (6, 11, 'hunter', 'eriaoubshng'),
 (6, 11, 'nought', 'eaoiustghbf'),
 (6, 11, 'wounds', 'eaoiusdnhpr'),
 (6, 11, 'wright', 'eaoithglrbf'),
 (7, 10, 'soames', 'erdsacthkl')]

(Soames Forsyte is a character in the Forsyte Saga by John Galsworthy; the wordlist has been converted to lower-case so it wasn't possible for me to quickly remove proper names.)

Good call on the frequently used word lists. http://invokeit.wordpress.com/frequency-word-lists/ has English, and Swedish, so nice to have both. — grrussel, Apr 25 '13 at 22:30
I would expect `bingle` to be rated harder than `single` or `tingle` - `bingle` is a less common word **and** `b` is a less common letter — BlueRaja - Danny Pflughoeft, Apr 25 '13 at 22:40
Cool algorithm (and thanks for explaining in English before writing code!). But I think you should try to minimise the number of *incorrect* guesses. Thus, if the dictionary were [bat, bet, hat, hot, yum], I would guess 'T' (rather than B, A or H). If I'm right, it doesn't cost me anything. If I'm wrong, then only 'yum' remains. — Colonel Panic, Apr 25 '13 at 22:42
Subjectively, I didn't even know some of those words were, actually, words in English so I would be unlikely to guess a letter I did not think would result in a word. — grrussel, Apr 25 '13 at 22:45
@ColonelPanic: Good point! I must have been playing too much Mastermind (where every guess costs the same). — Gareth Rees, Apr 25 '13 at 22:47
It might be good to take into account [letter frequency](http://en.wikipedia.org/wiki/Letter_frequency) for the language being used, or if the dictionary is large enough, simply the dictionary. As letters generally aren't generally guessed randomly, but rather typically have some relationship to letter frequency, even if the guesser doesn't know it. This extends what Colonel Panic is saying. — Nuclearman, Apr 26 '13 at 00:56
This is a really cool algorithm, but I think doesn't reflect the strategy human players are likely to do - instead of knowing every single word, humans are going to recognize (probabilistically) most common words, and otherwise will try to recognize suffices and prefixes (like ion, ing) and failing that just guess common letters (starting with vowels, then doing t/r/s/n/etc). Not sure how to code this but it's something to think about :) — Patashu, Apr 26 '13 at 04:04
To put a somewhat formal spin on this -- words whose entropy decreases slower as you "guess" letters are considered "harder". — TC1, Apr 26 '13 at 09:02
Great analysis. As @Patashu points out, the next step to make this even better would be rather than just take a dictionary of common words, to take a full dictionary of words but with annotations about commonality, and simply heuristically weigh in the commonness of the word with the letter-distribution-difficulty. But that's just for optional improvement -- this is already an excellent solution as it stands. — Ben Lee, Apr 30 '13 at 21:45

Blender · Answer 2 · 2013-05-09T17:12:20.170

21

A really simple way would be to compute a score based on the lack of vowels in the word, the number of unique letters, and the commonness of each letter:

letters = 'etaoinshrdlcumwfgypbvkjxqz'
vowels = set('aeiou')

def difficulty(word):
    unique = set(word)
    positions = sum(letters.index(c) for c in word)

    return len(word) * len(unique) * (7 - len(unique & vowels)) * positions

words = ['the', 'potato', 'school', 'egypt', 'floccinaucinihilipilification']

for word in words:
    print difficulty(word), word

And the output:

432 the
3360 potato
7200 school
7800 egypt
194271 floccinaucinihilipilification

You could then score the words with:

        score < 2000   # Easy
 2000 < score < 10000  # Medium
10000 < score          # Hard

edited May 09 '13 at 17:12

answered Apr 25 '13 at 19:56

Blender

289,723
53
439
496

Hi blender, can you please tell me what the magic number 7 is there for? Why not 6 or 50? What Happens if I put in another arbritary number? – Pavan Sep 15 '14 at 15:49
@Pavan: Nothing really. The scores of all the words will be shifted up by the same amount. – Blender Sep 15 '14 at 21:38
yeah I noticed the shift when I was playing around with an online python executer. I noticed something and that is when I type in something like fantastical compared to abhorrent, abhorrent would have a lower value than fantastical despite fantastical being a word that is spelled more correctly thus should appear in a lower difficulty level in a word game. This made me realise that difficulty is subjective but it had me thinking that there should be some sort of study done to outline which words are most difficult to spell over others, right? Could you point me to such a study pls? – Pavan Sep 16 '14 at 11:06
Or atleast what a study like that would be called since I'm having difficulty finding a collection of words with the percentage of people having incorrectly spelled the word on first attempt - which is what I'm now after. – Pavan Sep 16 '14 at 11:07

score 9 · Answer 3 · answered Apr 25 '13 at 20:09

9

You can use the Monte Carlo Method to estimate the difficulty of a word:

Simulate a game by guessing a random letter each time, weighted by letter's frequency in your target language, and count how many guesses it took your randomized player to arrive at a solution. Note that since each guess eliminates a letter, this process is finite, and it returns a number from 1 to 26, inclusive.
Repeat this process 2*N times, where N is the number of unique letters in your word,
Calculate the score by averaging the results of 2*N runs,
Determine the complexity level: scores less than ten indicate an easy word, and scores above sixteen indicate a hard word; everything else is medium.

answered Apr 25 '13 at 20:09

Sergey Kalinichenko

714,442
84
1,110
1,523

2

I think you should count only *incorrect* guesses. There's no penalty for correct guesses. – Colonel Panic Apr 25 '13 at 22:08
Why that number of repeats? I think this strategy (like most randomised strategies) has greater variance for *shorter* words. – Colonel Panic Apr 25 '13 at 22:11
@ColonelPanic I think that counting the total number of guesses is better, because it naturally incorporates the number of distinct letters into the answer. You may be right about the variance on shorter words being higher. Perhaps the number of repeats should be fixed then. However, I think 2N would be an OK start. – Sergey Kalinichenko Apr 26 '13 at 01:46

score 4 · Answer 4 · edited May 23 '17 at 11:54

4

Previous similar discussion around the same topic: Determine the difficulty of an english word

I like the answer at the end of the link ^. For a kids hangman game, just apply an approach like scrabble does.

Assign a point value to each letter, then just add up the letters.

edited May 23 '17 at 11:54

Community

1
1

answered Apr 25 '13 at 20:21

Alan Waage

603
4
12

1

This, together with avoiding rare or obscure words on easy levels, seems like the way forwards for now. A complication I had not mentioned is that the words are selected from huge dictionaries, the bulk of which must by definition be rarely used words :-) – grrussel Apr 25 '13 at 20:52
Point values could work, likely making using of [letter frequency](http://en.wikipedia.org/wiki/Letter_frequency). Although, some commonly used words may actually have oddly high point values. – Nuclearman Apr 26 '13 at 01:00

Colonel Panic · Answer 5 · 2013-04-25T22:47:24.073

Just do it! Play hangman against the word. Count how many forfeits (ie. incorrect guesses) it takes to beat.

You'll need a strategy to play. Here's a human(ish) strategy. From the dictionary, strike out all the words that don't fit the reveals so far. Guess the letter most frequent among the words remaining.

If your strategy is randomised, you can define your measure as the expected number of forfeits, and estimate that empirically.

Another deterministic strategy, from a hangman bot I wrote a few years ago. Guess the letter that minimises the number of words remaining in the case the guess is incorrect (ie. optimise the worst case). Today I dislike this strategy for being too mechanical, I prefer the one above.

Haha, I was just going to suggest the same thing. But a serious version: write a simple bot that guesses using some simple strategy, then just run that a whole bunch of times over words from a dictionary. — Tikhon Jelvis, Apr 25 '13 at 19:59

score 3 · Answer 6 · answered Apr 25 '13 at 22:46

A while back I wrote a hangman solver using the obvious algorithm: given an initial dictionary of all possible words, at each turn we choose the letter that occurs in the most words remaining in the dictionary, then remove non-matching words (depending on the response) from the dictionary.

The algorithm isn't quite as straightforward as this, since there are often several letters which each occur in the same number of words in the dictionary. In this case, the choice of letter can make a significant difference to how many guesses are required for a word. We pick the maxima where the resulting information about placement of that letter (if is is indeed in the word) gives the maximum information about the system (the letter with the maximum information entropy). e.g. if the two remaining possible words are 'encyclopedia' and 'encyclopedic', the letter 'c' has the same probability of appearing as as e,n,y,l,o,p,e,d,i (i.e. it is guaranteed to be in the word), but we should ask about 'c' first since it has a nonzero information entropy.

Source (C++, GPL) is here

The result of all this is a list of words, with the number of guesses required for each one: difficulty.txt (630KB). The hardest word for this algorithm to find is "will" (with 14 failed guesses); the i and double l are guessed pretty quickly, but then the options include bill, dill, fill, gill, hill, kill, mill, pill, rill, till, will, and from then on the only option is to guess each letter in turn. Somewhat counterintuitively, longer words are much guessed much more quickly (there just aren't that may of them to choose from).

Of course, in a human game of hangman, psychology (and breadth of vocabulary) play a much greater role than this algorithm accounts for...

score 2 · Answer 7 · answered Apr 25 '13 at 19:52

2

First, of course, you'd generate a list of unique letters. Then sort by frequency (in English or whatever language -- there are lists for this), with less frequent letters having a higher difficulty.

Then you need to decide whether you combine the scores by adding, multiplying, or using some other scheme.

answered Apr 25 '13 at 19:52

Hot Licks

47,103
17
93
151

(Actually, you may not need to *sort* by frequency, but just accumulate the frequency scores. Though it may be that sorting provides some more info -- worth a try to see if it seems to do something for you.) – Hot Licks Apr 25 '13 at 20:04
And you may want to somehow account for letter combos -- ie, if there is a Q there is almost certainly a U, and a U makes a Q much more likely. So it might make sense, eg, to regard QU as a single letter, from a frequency POV. – Hot Licks Apr 25 '13 at 20:06

score 1 · Answer 8 · answered Apr 25 '13 at 19:45

1

You're getting downvoted because you're asking us to build a very complex algorithm for you.

Why don't you just create three arrays (easy,medium, and hard) and populate each with a hundred or so words? It would take about 20 minutes.

I promise your kids will get bored of hang man long before they burn through a few hundred games... :D

answered Apr 25 '13 at 19:45

BBagi

2,035
1
19
23

3

It doesn't have to be *that* complex. E.g. take a look at Blender's comment, for example. Your answer doesn't really address the core question and isn't particularly useful. – Tikhon Jelvis Apr 25 '13 at 19:47
4

“Why don't you just create three arrays (easy,medium, and hard) and populate each with a hundred or so words?”: also called the “solve the problem by assuming the problem is already solved” method. – Pascal Cuoq Apr 25 '13 at 19:52
Criticism taken, thank you... I guess from an academic point of view, you're absolutely right, my answer doesn't solve anything. But from a practical point of view, that is, the easiest way to build a hangman game for your kids, my answer does solve it, cheaply and quickly. – BBagi Apr 25 '13 at 20:43
Fair point, the simplest approach is simply to rate and type 3 lists and be done. This way is more fun for me, though :-) – grrussel Apr 27 '13 at 12:51
1

@PascalCuoq Or you could say this is the approach to "solve the problem by assuming humans are better at choosing suitable lists than algorithms". Given that the questioner wants a game for children, it seems better that "hat, cat, sun" are in the easy list, and "xylophone, nought, school" are in the difficult list, even if those might be found with fewer guesses on average. – Darren Cook May 01 '13 at 23:25
1

@PascalCuoq There's nothing wrong with bypassing a complex problem though a simple solution if you can get away with it. There's nothing wrong with building complex algorithms for fun either, but the simple solution at least deserves a mention. – David May 17 '13 at 14:42

score 1 · Answer 9 · answered Apr 25 '13 at 20:10

Well, potentially there could be a lot of things involved:

As everyone said, the frequency of the individual letters;
The length of a word definitely should count, but not in a linear way - a long word can make random guesses hit the letters, while a short one can be hard to get;
Also, the words themselves should be considered - "bipartite" might be a word for people on SO, but maybe not for non technical population.

Actually, you could try to co-evolve several strategies, half of them for deciding the worth of a word, and half of them for trying to win the game. The latter group will try to maximize the score while the first one try to minimize the score. After a while there could be a pattern and then the half for deciding the worth of a word may give you some benchmarks.

The frequency of use of a word is a good point. My first attempt based on scoring unique letters by frequency claimed "eutectic" was an "easy" word. Google ngrams http://storage.googleapis.com/books/ngrams/books/datasetsv2.html seems likely to help identify words in common use, today. — grrussel, Apr 25 '13 at 20:48

collapsar · Answer 10 · 2013-04-26T00:43:04.343

Start with a List of words and Launch a google search for each One. Let the The number of Hits serve as a (coarse) Proxy of the term's difficulty.

In a refined version you'd group words by a synonym Relation Based on a Thesaurus and determine the most difficult word of a category by counting the Results of google searches.

Taking the Notion of n-Grams One step further, the difficulty of a Word could be rated by the frequency of its syllables in prose. Depends on the quality of the syllable statistics, of course. You'd probably have to Differentiate between Lexemes and Function words ( determiners, conjunctions etc. ) and Normalize by number of syllables in the Word (Feels like Overkill as i Write ...).

score 0 · Answer 11 · answered Apr 26 '13 at 01:07

I like the idea of building an algorithm that learns and changes depending on the users. At the beginning, you can implement any of the algorithms suggested to come up with the list, then as more people play the game, you assign a weight to each of the words depending on the number of guesses (which is also continually tracked and calculated). This prevents the issue of complex but popular words being given difficult rating but are well-known to people.

score 0 · Answer 12 · answered May 09 '13 at 18:17

Compute the value of each letter of a word in Scrabble points: E=1, D=2, V=4, X=8 and so on. Add them up and divide by the number of letters to get an average letter value, and use that to score the word. Compute the average for each word in a large dictionary, and determine the break points between quartiles. Call words in the lowest quartile "easy", words in the two middle quartiles "medium", and words in the highest quartile "hard".