4

I'm trying to find a substring from a string text that is an anagram to a string pattern.

My Question: Could the Rabin-Karp algorithm be adjusted to this purpose? Or are there better algorithms?

I have tried out a brute-force algorithm, which did not work in my case because the text and the pattern can each be up to one million characters.

Update: I've heard there is a worst-case O(n2) algorithm that uses O(1) space. Does anyone know what this algorithm is?

Update 2: For reference, here is pseudocode for the Rabin-Karp algorithm:

function RabinKarp(string s[1..n], string sub[1..m])
    hsub := hash(sub[1..m]);  hs := hash(s[1..m])
    for i from 1 to n-m+1
       if hs = hsub
          if s[i..i+m-1] = sub
              return i
       hs := hash(s[i+1..i+m])
    return not found

This uses a rolling hash function to allow calculating the new hash in O(1), so the overall search is O(nm) in the worst-case, but with a good hash function is O(m + n) in the best case. Is there a rolling hash function that would produce few collisions when searching for anagrams of the string?

Shai
  • 111,146
  • 38
  • 238
  • 371
Rami Jarrar
  • 4,523
  • 7
  • 36
  • 52

4 Answers4

9

Compute a hash of the pattern that doesn't depend on the order of the letters in the pattern (for example, use the sum the character codes for each letter). Then apply the same hash function in "rolling" fashion to the text, as in Rabin-Karp. If the hashes match, you need to perform a full test of the pattern against the current window in the text, because the hash may collide with other values too.


By associating each symbol in your alphabet to a prime number, then computing the product of those prime numbers as your hash code, you will have fewer collisions.

There is, however, a bit of mathematical trickery that will assist you if you want to compute a running product like this: each time you step the window, multiply the running hash-code by the multiplicative inverse of the code for the symbol that is leaving the window, then multiply by the code for the symbol that is entering the window.

As an example, suppose you are computing the hash of letters 'a'–'z' as an unsigned, 64-bit value. Use a table like this:

symbol | code | code-1
-------+------+---------------------
   a   |    3 | 12297829382473034411
   b   |    5 | 14757395258967641293
   c   |    7 |  7905747460161236407
   d   |   11 |  3353953467947191203
   e   |   13 |  5675921253449092805
  ...
   z   |  103 | 15760325033848937303

The multiplicative inverse of n is the number that yields 1 when multiplied by n, modulo some number. The modulus here is 264, since you are using 64-bit numbers. So, 5 * 14757395258967641293 should be 1, for example. This works, because you are just multiplying in GF(264).

Computing a list of the first primes is easy, and your platform should have a library to efficiently compute the multiplicative inverse of these numbers.

Start coding with the number 3 because 2 is co-prime with the size of an integer (a power of 2 on whatever processor you are working on), and cannot be inverted.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • Are you sure the average case is O(n)? It seems like a lot of different substrings could be false positives. – templatetypedef Feb 03 '13 at 01:42
  • I've implemented this hash (sum the character ASCII code) and using rabin-karp and it worked great,, thank you :) – Rami Jarrar Feb 07 '13 at 19:04
  • I've tested it, it collides for each string 10% of the whole substrings tested, is there a better rolling hash could be used ? – Rami Jarrar Feb 07 '13 at 21:32
  • Are you simply summing the codes in an `int`? You could try computing the product of the codes in a `long`. Collisions would be less likely if your hash codes are distributed in a larger space. – erickson Feb 07 '13 at 22:23
  • yeah i'm summing codes in an `int`, i tried that but it's not working ! – Rami Jarrar Feb 07 '13 at 22:50
  • maybe my fault in using rolling hashing, how to slide the window, I multiply the code of the new character and divide by the first character,, i think this is the problem,, right ? – Rami Jarrar Feb 11 '13 at 23:31
  • @RamiJarrar Yes, I'm afraid simply dividing by the first character won't work in general. If your string is long, the product becomes too big to fit in a integer (you'd have to use an arbitrary-precision library, which would be too slow), so the result is truncated; you are really doing multiplication modulo 2^64 or whatever. To do that correctly, you have to use a multiplicative inverse, rather than division, as I describe in my answer. Downside is a little study, upside is that it's fast and collision-resistant. – erickson Feb 11 '13 at 23:49
  • 1
    @RamiJarrar See if you can "un-accept" this answer. I think the answers from [Dave](http://stackoverflow.com/a/14836347/3474) and [templatetypedef](http://stackoverflow.com/a/14668006/3474) (which are essentially the same solution) are easier to understand and should work. With my solution, you need something like the histogram to verify that what is in the window is actually an anagram, and not just a false collision. Once you have that, using it to implement their solutions is a small step. – erickson Feb 12 '13 at 17:35
  • @erickson: your solution worked great for me, I'm just trying to use better hash function :) – Rami Jarrar Feb 12 '13 at 17:44
  • Are small primes the ideal set of mappings? Perhaps other numbers could produce fewer collisions. The obvious property of small primes is that they're the slowest to overflow, but overflow isn't necessarily something to be avoided. If you switched the role of code and inverse, for example, you'd be certain to get exactly the same outcome despite overflowing on almost every operation and not using primes. – sh1 Mar 05 '17 at 02:39
8

One option would be to maintain a sliding window holding a histogram of the letters contained within the window. If that histogram ever ends up equal to the character histogram for the string whose anagram should be found, then you know that what you are looking at is a match and can output it. If not, you know that what you have cannot possibly be a match.

More concretely, create an associative array A mapping from characters to their frequencies. If you want to search for an anagram of string P, then read the first |P| characters from the text string T into A and build the histogram appropriately. You can slide the window one step forward and update A in O(1) associative array operations by decrementing the frequency associated with the first character in the window, then incrementing the frequency associated with the new character that has slid into the window.

If the histograms of the current window and the pattern window are very different, then you should be able to compare them rather quickly. Specifically, let's say that your alphabet is Σ. In the worst case, comparing two histograms would take time O(|Σ|), since you'd have to check each character/frequency pair in the histogram A with the reference histogram. In the best case, though, you'd immediately find a character that causes a mismatch between A and the reference histogram, so you would not need to look at many characters overall.

In theory the worst-case runtime for this approach is O(|T||Σ| + |P|), since you have to do O(n) work to build the initial histogram, then have to do worst-case Σ work per character in T. However, I'd expect that this is probably a lot faster in practice.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • aha,, I'll test this algorithm, but, but I think (someone told me) there is a much faster one, and with only constant memory ! – Rami Jarrar Feb 03 '13 at 00:22
  • An optimization you could do is, each time you compare the histograms, keep track of how many characters you're 'missing'. Then you can move forward that many characters (still updating the histogram) before checking again. – Bwmat Feb 06 '13 at 22:30
  • 2
    This is actually a much better algorithm core than the one selected as the solution. The trick that would make it really fast would be to keep a set of histogram elements that currently do not match the pattern histogram. You can adjust this set incrementally in constant time for each move of the window. When the set is empty, you have a match! I hate it when the superior solution does not win. – Gene Feb 09 '13 at 00:28
  • 2
    For those that read only the top-voted answer; [Dave's answer](http://stackoverflow.com/a/14836347/2417578) improves on this by eliminating the comparison by initialising the histogram to the negative image of the search string and detecting when the sliding window zeroes that. – sh1 Mar 05 '17 at 03:07
3
  1. Create an array letter_counts of 26 ints (set to zero) and a variable missing_count to hold the count of missing letters.

  2. For each letter in the substring, decrement the associated int of letter_counts by 1, and increment missing_count by 1 (so missing_count will end up equal to the size of the substring).

  3. Say the substring is of size k. Look at the first k letters of the string. increment the associated int of letter_counts by 1. If after incrementing, the value is <= 0, decrement missing_count by 1.

  4. Now, we 'roll forward' along the string like this. a. remove the letter closest to the start of the window, decrement the associated member of letter_counts. If after decrementing, we have an int < 0, then increment missing_count by 1. b. add the first letter of the string beyond the window. increment the associated member of letter_counts. If after incrementing we have an int <= 0, then decrement missing_count by 1.

If at any point missing_count == 0, we have an anagram of the search string in our window.

The invariant we maintain is that missing_count holds the number of letters in our substring that aren't in our window. When this is zero, the letters in our window are an exact match for the letters in our substring.

This is Theta(n) -- linear time, since we look at each letter exactly once.

--- edit ---

letter_counts only needs to store the distinct letters of the substring, and only needs to hold integers as big as the size of the substring (signed). Thus the memory usage is linear in the size of the substring, but constant in the size of the string.

Dave
  • 7,460
  • 3
  • 26
  • 39
0

Might be silly of me to suggest this, but one alternative could be to break down the two strings into arrays and then recursively search them character-by-character.

To avoid duplicate character matches, if a character is found in the text array, its respective array index is removed, effectively shrinking time-to-completed-array-scan with each match while at the same time ensuring that a text containing 2x 'B' won't match a pattern with 3x 'B'.

For added performance, you could scan both strings prior to doing character-by-character count, and make a list of which alphabetic letters exists in each string, then compare those lists to see if there is any discrepancies (for example trying to find the letter "z" in "apple"), and if there is mark the string as "Anagram not possible".

Desty Nova
  • 34
  • 1