19

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.

h(abcdef) = h(bcdefa) = h(cdefab) etc

Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.

I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?

It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...

Community
  • 1
  • 1
Phil H
  • 19,928
  • 7
  • 68
  • 105

8 Answers8

9

I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".

So:

"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
7

Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.

I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.

static int Hash(string s)
{
    int H = 0;

    if (s.Length > 0)
    {
        //any arbitrary coprime numbers
        int a = s.Length, b = s.Length + 1;

        //an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
        int[] c = new int[0xFF];

        for (int i = 1; i < c.Length; i++)
        {
            c[i] = i + 1;
        }

        Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
        Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();

        //for i=0 we need to wrap around to the last character
        H = NextPair(s[s.Length - 1], s[0]);

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= NextPair(s[i - 1], s[i]);
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine("{0:X8}", Hash("abcdef"));
    Console.WriteLine("{0:X8}", Hash("bcdefa"));
    Console.WriteLine("{0:X8}", Hash("cdefab"));
    Console.WriteLine("{0:X8}", Hash("cdfeab"));
    Console.WriteLine("{0:X8}", Hash("a0a0"));
    Console.WriteLine("{0:X8}", Hash("1010"));
    Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
    Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}

The output is now:

7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71

First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:

static int Hash(char[] s)
{
    //any arbitrary coprime numbers
    const int a = 7, b = 13;

    int H = 0;

    if (s.Length > 0)
    {
        //for i=0 we need to wrap around to the last character
        H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine(Hash("abcdef".ToCharArray()));
    Console.WriteLine(Hash("bcdefa".ToCharArray()));
    Console.WriteLine(Hash("cdefab".ToCharArray()));
    Console.WriteLine(Hash("cdfeab".ToCharArray()));
}

The output is:

4587590
4587590
4587590
7077996
Michael Petito
  • 12,891
  • 4
  • 40
  • 54
  • Also, as for checking n strings against each other, you might consider feeding K versions of this hash algorithm (perhaps using different coprimes) into a bloom filter of sufficient size for n. – Michael Petito Apr 06 '10 at 13:52
  • 1
    It's fairly easy to come up with collisions here. For example, "a0a0" and "1010" (or indeed anything similar) will come up with a hash of 0, and "blocks" with a common boundary confuse it: "0abc0def0ghi" and "0def0abc0ghi" have the same hash. Nice idea though. – Jon Skeet Apr 06 '10 at 14:26
  • @Jon Skeet Yes, you are absolutely right. I wonder if there is a simple modification that could be made to handle such input... – Michael Petito Apr 06 '10 at 16:39
  • Is there any equivalent for this when dealing with bit-strings? Generating coprimes of a single bit doesn't really work out. – Jeremy Salwen Sep 29 '11 at 03:51
  • @Jeremy: Yes, I believe you could handle a bit string by considering a sliding window of n bits and index into the coprime array using that value. Just as the example here considers each pair of characters at a time (n=2), you could consider a pair of substring of length n from [i-1, i-1+n) and [i, i+n). – Michael Petito Sep 29 '11 at 05:14
  • Hi Michael, I tried your code in C++, but didn't get quite the same result. My code returns the same hash value for the last two lines. Here is my code: [gist](https://gist.github.com/VirgilMing/f0b40a41e8482d8cc8e1bfeafc7f51a4). Is there anything amiss in my code? Or will the specifics of `GetHashCode()` influence the result? – Virgil Ming Nov 15 '17 at 16:26
2

You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.

Chris Lercher
  • 37,264
  • 20
  • 99
  • 131
1

I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)

You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.

PatrikAkerstrand
  • 45,315
  • 11
  • 79
  • 94
1

Here's an implementation using Linq

public string ToCanonicalOrder(string input)
{
    char first = input.OrderBy(x => x).First();
    string doubledForRotation = input + input;
    string canonicalOrder 
        = (-1)
        .GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
        .Skip(1) // the -1
        .TakeWhile(x => x < input.Length)
        .Select(x => doubledForRotation.Substring(x, input.Length))
        .OrderBy(x => x)
        .First();

    return canonicalOrder;
}

assuming generic generator extension method:

public static class TExtensions
{
    public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
    {
        var current = initial;
        while (true)
        {
            yield return current;
            current = next(current);
        }
    }
}

sample usage:

var sequences = new[]
    {
        "abcdef", "bcdefa", "cdefab", 
        "defabc", "efabcd", "fabcde",
        "abaac", "cabcab"
    };
foreach (string sequence in sequences)
{
    Console.WriteLine(ToCanonicalOrder(sequence));
}

output:

abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc

then call .GetHashCode() on the result if necessary.

sample usage if ToCanonicalOrder() is converted to an extension method:

sequence.ToCanonicalOrder().GetHashCode();
Handcraftsman
  • 6,863
  • 2
  • 40
  • 33
1

One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.

More formally, consider

for(int i=0; i<string.length; i++) {
  result^=string.rotatedBy(i).hashCode();
}

Where you could replace the ^= with any other commutative operation.

More examply, consider the input

"abcd"

to get the hash we take

hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").

As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.

Jeremy Salwen
  • 8,061
  • 5
  • 50
  • 73
  • Elegant, but I'm suspicious that this may have a high number of collisions with strings that have permutations of the same elements. – SigmaX Nov 22 '13 at 19:25
  • 1
    Well every call to the base hash function will pass an argument which is unique to the string and its rotations, so assuming you have a cryptographic hash function, the output should be random. – Jeremy Salwen Dec 06 '13 at 21:20
  • Ah yes, I had misread it. Thought you were ORing the hashcodes of each character, rather than each "rotatedBy". – SigmaX Dec 10 '13 at 00:33
0

I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.

If you can represent the string as a matrix of associations so abcdef would look like

  a b c d e f
a   x
b     x
c       x
d         x
e           x
f x

But so would any combination of those associations. It would be trivial to compare those matrices.


Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.

Here is some Ruby code:

def normalize_string(string)
  myarray = string.split(//)            # split into an array
  index   = myarray.index(myarray.min)  # find the index of the minimum element
  index.times do
    myarray.push(myarray.shift)         # move stuff from the front to the back
  end
  return myarray.join
end

p normalize_string('abcdef').eql?normalize_string('defabc') # should return true
Fotios
  • 3,643
  • 1
  • 28
  • 30
  • @Fotios: Would the first solution really work if the elements aren't unique? "ab" and "abab" would produce the same matrix, if I understand it correctly? It may still be good enough for a hash function! – Chris Lercher Apr 06 '10 at 14:48
  • Yea, it probably would not work with multiples like that, but there might be ways to work around that. – Fotios Apr 06 '10 at 15:05
0

Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Maria Sakharova
  • 1,389
  • 15
  • 15