6

I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:

jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf

What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?

What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.

unor
  • 92,415
  • 26
  • 211
  • 360
  • 1
    Well, once you pick a dictionary, you **do** "know the set of words in advance": they're the set of words in the dictionary you pick. – Tim Peters Oct 12 '13 at 19:17
  • 3
    Start with the first character. Is it a word? Add the next character. Is it a word? Is there a word that starts with those characters? NO - remove the first character from the big string and start over. YES - Is it a word? Is there a word that starts with those characters? NO - remove the first character from the big string and start over. YES - Is it a word? ........... – wwii Oct 12 '13 at 19:50
  • 2
    Or maybe a better idea- is the first dictionary word in the string? Is the second dictionary word in the string? Is the third dictionary word in the string? ... Is the nth dictionary word in the string? – wwii Oct 12 '13 at 19:54
  • 1
    You can use a binary/bisection search to search through the wordlist to see if a group of letters is a word or a prefix. – wwii Oct 12 '13 at 20:01
  • I'm not sure that it's worth worrying about efficiency; a quick brute-force search took only 0.6 seconds to find ~1700 words from a 98000-word dictionary in a 28000-character string. – DSM Oct 12 '13 at 20:14
  • 1
    (1) Jam your dictionary into a `trie` (2) Iterate over your data and see if you can match a trie entry against the current start position. – hughdbrown Oct 12 '13 at 20:59

3 Answers3

10

I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words but I'm sure you can find plenty of text file dictionaries online if not that one.

from random import choice
import string

dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words

text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself

words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
    chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
    for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
        word = chunk[:j] #subchunk
        if word in dictionary: #constant time hash lookup if it's in dictionary
            words_found.add(word) #add to set of words

print words_found
GrantS
  • 781
  • 7
  • 16
  • 3
    One very minor point: `text += text[::-1]` could introduce a problem, because it's possible that `"red"` only exists at the end of `text`, and after adding the reverse you have `"redder"` in the centre, which isn't a word in the original string. – DSM Oct 12 '13 at 20:20
  • 1
    I was aware of this but didn't think it a big enough deal to mention. I should have known SO would call me out ;) I edited it to fix the problem. – GrantS Oct 12 '13 at 20:23
  • Clearly you have for i in range, for j in range .... how could this run on O(n) time? – Gang Su Feb 09 '16 at 00:17
  • `j` is bounded by `max_len`, which is a constant (`max_len` is the length of the longest word in the dictionary). – GrantS Feb 10 '16 at 01:30
  • I think I have this working partially. I am using the test string 'abzcatrwxadfsa' and getting the results of >>> print (words_found) {'sava', 'saya', 'saka', 'alab', 'sara', 'ahab', 'saha', 'safa', 'sana', 'sala', 'sapa', 'sasa', 'arab', 'saga', 'sada', 'saba', 'sawa', 'anab', 'sama', 'akab', 'zcat'} The results contain letters not even in the string. What could be going wrong? – Dan Jan 24 '21 at 10:09
  • @Dan When I run the code with your string substituted instead of the 28,000 random characters, I get the following result: `set(['ab', 'ad', 'ca', 'as', 'at', 'ax', 'ta', 'ba', 'da', 'a', 'c', 'b', 'd', 'f', 'cat', 's', 'r', 't', 'w', 'x', 'sa', 'z'])`. Are you sure you are replacing the code that generates 28K random characters? That is, I change line 7 to `text = "abzcatrwxadfsa"`. – GrantS Jan 29 '21 at 02:16
1

Here is a bisection/binary search that should be usefull.

def isaprefix(frag, wordlist, first, last):
    """
    Recursive binary search of wordlist for words that start with frag.

    assumes wordlist is a sorted list
    typically called with first = 0 and last = len(wordlist)

    first,last -->> integer
    returns bool
    """

    # base case - down to two elements
    if (last - first) < 2:
        # return False unless frag is a prefix
        # of either of the two remaining words
        return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)

    #mid = (first + last)/2
    midword = wordlist[(first + last) / 2]

    # go ahead and return if you find one
    # a second base case?
    if midword.startswith(frag):
        return True

    #print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
    # start the tests
    # python does just fine comparing strings
    if frag < midword:
        # set the limits to the lower half
        # of the previous range searched and recurse
        return isaprefix(frag, wordlist, first, mid-1)

    # frag is > midword: set the limits to the upper half
    # of the previous range searched and recurse
    return isaprefix(frag, wordlist, mid+1, last)
wwii
  • 23,232
  • 7
  • 37
  • 77
0

You can think of creating a sequence out of the entire dictionary and then aligning them to get the words in the sequence using smith water man or any heuristic local alignment algorithm