0

I am trying to find a hash function which could provide some outputs that have their values well distributed across certain range of values. And the input will be IP addresses.

We are storing IP address as a key into HBase, my understanding is that the rowkey is used in hbase to distribute each row into different region server. If we could distribute the IP addresses well, then the read/write performance should get some level of improvement.

nitish712
  • 19,504
  • 5
  • 26
  • 34
Tim Raynor
  • 733
  • 2
  • 12
  • 28

2 Answers2

0

You have to take care of both IPv4 and IPv6. Fortunately, you can represent each of them as an integer number - 32-bit in case of IPv4 and 128-bit in IPv6.

You can find an example of code to convert an ip address to a long (or an array of longs for Ipv6) in this question.

After you converted IPs to numbers, it's rather trivial to make an evenly distibuted function of the values. The simplest approach is just taking a remainder of division by some number (e.g. the number of regions).

Community
  • 1
  • 1
rnov
  • 435
  • 2
  • 3
0

I already worked on this problem, long time ago. Interesting fact: simple hash functions does not provide good pseudo-random distribution. Good distribution can be obtained only by non-linear or cryptography hashes, like MD5 or SHA1. In our solution, we used custom non-linear hash, like following:

// Substitute-box: non-linear transform. 
// Must be filled by random values prior to use
uint32_t s_box[256]; 

uint32_t ip_hash(const uint8_t *ip, uint8_t len) {
  uint32_t rc = 0x1f351f35;
  while(--len) {
    uint8_t x = *ip++;
    rc = ((rc << 7) | (rc >> (32-7))) + (s_box[x ^ (uint8_t)rc] ^ x);
  return rc ^ (rc >> 16); 
}
olegarch
  • 3,670
  • 1
  • 20
  • 19