2

I have several applications that create a unique (with high probability), human-readable checksum or digital signature by applying a cryptographic hash like MD5, then using the resulting bits with an arithmetic coder to select words from a list. I've simply been using /usr/share/dict/words, but recently a client (rightly) complained about receiving a document whose checksum included offensive words or trigger words. (More details at my answer to Generate User Friendly Codes).

For this application, long lists are important, as they avoid repeats---the list I'm using has many tens of thousands of words.

Does anyone know either how to remove offensive and trigger words from such a list, or where to find a list of innocuous words?

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • Basically, you are asking us to find you a list of innocuous words, or a list of offensive words. That is off-topic. But the obvious solutions are: 1) find a list of words and manually remove the offensive ones, 2) automatically remove all words in an offensive word list from a larger word list. But remember Scunthorpe! – Stephen C Mar 10 '18 at 00:54
  • Or another approach: use a *smaller* list of words that is easier to vet. For this purpose, it makes little difference if you use (say) a 2,000 word list or a 20,000 word list. – Stephen C Mar 10 '18 at 00:58
  • My client seemed to think there were automated tools that could curate such a list, which I hoped might be in scope. Ruling out Scunthorpe is OK provided I can get a list of around 10,000 words (which takes the probability of a repeat sufficiently low for my purposes). – Norman Ramsey Mar 10 '18 at 01:12
  • Nope. Asking for a tool recommendation is off-topic. But there is this wonderful tool called Google search. And, to be honest, if your customer is so worried about this, then they should be doing the vetting and providing the list to you. In this case, since everyone has a different idea on what words are offensive, **they** should be making the judgment. – Stephen C Mar 10 '18 at 01:13
  • See https://stackoverflow.com/help/on-topic ... item 4 in the list. – Stephen C Mar 10 '18 at 01:22
  • If just had a list of license plate letters that would be bounty answer. I can make a sounds like query. I know 9=G 5=S 2=Z il1 = [I|L] am I missing something. A snipet of code to start from. IRL that's the all I need for the bounty. I occasionally have to print codes so they must be clean. – danny117 Apr 24 '18 at 14:06
  • For MD5, a million word dictionary would transform each checksum into 7 words. For a 10,000 word dictionary, it'd only be 10. What purpose will this human-readable checksum serve? Are people comparing them? Could you use emojis or pictures instead? – Blender Apr 28 '18 at 05:06

1 Answers1

2

One possibility is the ENABLE word list, used by Words with Friends and some other games. They try to avoid controversial words (pick your favorites and you won't find them there!-) It is in the public domain, so you can find it here and there. Its roughly 172,000 words. Here is one place I found it: http://www.greenworm.net/sites/default/files/gw-assets/enable1-wwf-v4.0-wordlist.txt

Also, Scrabble has divergent lists - the company which owns the game has the "filtered" list, while the clubs use the unfiltered lists for competition. I don't want to post a link to offensive material, but if you Google "seattle scrabble club expurgated words", you might find a list of the words removed from the naughty list to produce the nice list. If you find all the words you got complaints about on that list, you could just use it as a filter.

wordragon
  • 1,297
  • 9
  • 16