0

I want to create an algorithm that finds substrings inside a string wich exist in an alphabet. For example in a string "abcdefhhhasddasdabbba" i want to find substrings that are spawned by alphabet {'a','b'}

So my output will be like: ab ,a ,abbba

If I use finite state I have to create exact finite-states that include my outputs so i will have to take all possible combinations in length of my sting input which I do not think it's efficient at all.

If I use suffix tree then how could I find substrings which may not be prefix or post fix inside the tree? Should I use for each node an array that keeps data for the subtree and then check if the array contains characters which are not included in alphabet ?

Edit:

Complexity is crusial.

bill
  • 728
  • 3
  • 6
  • 15
  • The finite-state solution to this problem only requires two states: one for `{a,b}` and one for the rest of the alphabet. – Fred Foo May 24 '14 at 12:38
  • I think not because if it has only 2 states then as output ill get a ,b ,a ,b ... – bill May 24 '14 at 12:43

1 Answers1

2

This can be done by trivial loop, no data structures needed.

for each letter in word
  if letter in alphabet, then add it to a current "X"
  otherwise emit current "X" and set "X" to empty string

python example

word = 'aababadadadab'
alphabet = { 'a', 'b' }

X = ''
for letter in word + '$': 
  if letter in alphabet:
    X += letter
  else:
    print X
    X = ''

output:

aababa
a
a
ab

I use '$' as a special character from outside of alphabet, for simplier code

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • sorry for the miss-information but i want to do that with the lowest complexity. If i check each letter of the input with each character in the alphabet then the complexity will be like O(n*m) , n:len(input), m:len(alphabet) – bill May 24 '14 at 12:23
  • 2
    you are wrong. "checking a letter" is O(1) using hash set, so the algoiritm is O(n), the fastest possible. – lejlot May 24 '14 at 12:24
  • With a fixed alphabet, you can also use a bit vector instead of a hash table. – Fred Foo May 24 '14 at 12:30
  • @larsmans yet you still need a mapping from input characters to bit vector positions, which would require hashing (for O(1) access). So what is the difference ? – lejlot May 24 '14 at 12:32
  • @lejlot You don't need to hash anything, just use the character code as an index. (That won't be practical for full Unicode, but for smaller alphabets it's faster and simpler.) – Fred Foo May 24 '14 at 12:35
  • 1
    This is exactly the same as using a hashtable with identity hashing and size of the maximum code :) – lejlot May 24 '14 at 12:37