1

I have a text file containing more than 100k words, each separated by a newline in the file. I want to implement a function that would return the list of words containing a given substring. For example: If the substring is "coat", then it would return words like "coating", "raincoat", "raincoats" etc.

Is there an efficient algorithm to implement something like this, given that the list of words won't change. I was thinking of implementing a generalized suffix tree using Ukkonen's algorithm, but I can't find a proper implementation of that anywhere. Is there any better way to solve this apart from using suffix trees?

EDIT : The maximum length of each word can be 100 characters and it only has lowercase alphabets.

  • You could read the file into a buffer, create a suffix array, sort it, and then use binary search to look for matching substrings. Suffix arrays are vastly easier to construct than suffix trees. in my humble opinion. Since your file has newline separators you can truncate the lengths in the suffixes based on that. – 500 - Internal Server Error Dec 29 '21 at 10:55
  • Thanks for the reply! So you mean I create something like a generalized suffix array? As there are multiple words (more than 100k), so each word represents a string. – Abhimanyue Singh Tanwar Dec 29 '21 at 11:27
  • It's unclear what you mean by `word` - are you talking about the words in the search corpus or the words to match? – 500 - Internal Server Error Dec 29 '21 at 11:34
  • By word - I mean the words in the search corpus. (My text file contains these 100k+ words). – Abhimanyue Singh Tanwar Dec 29 '21 at 11:36
  • Then yes, basically. Keep in mind that for a suffix array you don't actually need to handle each string separately (allocate memory, and so on) - all you need is the buffer of text, then an array of tuples, buffer index and length, times the size of your input. – 500 - Internal Server Error Dec 29 '21 at 11:45

1 Answers1

0

If you can afford the extra storage, you can generate all suffixes of all words and use a single sorted list (to perform binary searches), or a hash. This multiplies space by the average word length.

If this average exceeds 4, it can be preferable to store 32 bit pointers that index the suffixes in the original strings (assumed null-terminated). It is also possible to use 3-bytes integers, in a packed format.