2

I want to hash strings of variable length (6-60 characters long) to 32-bit signed integers in order to save disk space in PostgreSQL.

I don't want to encrypt any data, and the hashing function needs to be reproducible and callable from Python. The problem is that I can only find Algorithms that produce unsigned integers (like CityHash), which therefore produce values of up to 2^32 instead of 2^31.

This is what I have thus far:

import math
from cityhash import CityHash32

string_ = "ALPDAKQKWTGDR"
hashed_string = CityHash32(string_)
print(hashed_string, len(str(hashed_string)))
max_ = int(math.pow(2, 31) - 1)
print(hashed_string > max_)
Asclepius
  • 57,944
  • 17
  • 167
  • 143
tryptofame
  • 352
  • 2
  • 7
  • 18
  • 3
    Just interpret the unsigned integer as a signed integer? – Ry- Jul 24 '17 at 11:37
  • @Ryan The result has to be between -2147483648 and +2147483647. How will your suggestion help? Can you please give me an example. – tryptofame Jul 24 '17 at 12:07
  • 2
    Subtract 2147483648. – Ry- Jul 24 '17 at 12:16
  • @Ryan Looks like a simple solution. I guess it would lead to the same amount of collisions as the original value and less than truncating a longer hash? – tryptofame Jul 24 '17 at 12:59
  • Well, it gets you 32 bits. There are other points to consider, though. What do you do with the strings that lets you throw them away and only store their hashes? – Ry- Jul 24 '17 at 22:01
  • @Ryan You solved it. Thank you!! But didn't post it as an answer. I could close the question if you did. The idea is to save disk space by storing integers and not strings, and hash strings I want to query on the fly. – tryptofame Jul 25 '17 at 08:42
  • Though I realised 32 bit will not be sufficient, 64 should be. – tryptofame Jul 25 '17 at 11:29
  • Erm, how are you handling collisions, then? If that was the reason for changing to 64-bit, you will probably want to both (a) use a cryptographic hash and (b) read up on https://en.wikipedia.org/wiki/Birthday_attack. – Ry- Jul 25 '17 at 15:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150107/discussion-between-tryptofame-and-ryan). – tryptofame Jul 25 '17 at 16:03

3 Answers3

2

Ryan answered the question in the comments. Simply subtract 2147483648 (= 2^31) from the hash result.

CityHash32(string_) - math.pow(2, 31)

or

CityHash64(string_) - math.pow(2, 63)

Ryan also mentioned that using SHA-512 and truncating the result to the desired number of digits will lead to less collisions than the method above.

tryptofame
  • 352
  • 2
  • 7
  • 18
0
create or replace function int_hash(s text)
returns int as $$

    select ('x' || left(md5(s), 8))::bit(32)::int
    ;
$$ language sql immutable;

select int_hash('1');
  int_hash  
------------
 -993377736
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • 1
    @Scoots The way it is broken has nothing to do with the question other than that with 4 bytes only the collision rate will be high. – Clodoaldo Neto Jul 24 '17 at 15:06
0

I typically wouldn't use a 32-bit hash except for very low cardinality because it of course risks collisions a lot more than a 64-bit hash would. Databases readily support bigint 8-byte (64-bit) integers. Consider this table for some hash collision probabilities.

If you're using Python ≥3.6, you absolutely don't need to use a third-party package for this, and you don't need to subtract an offset either, since you can directly generate a signed 64-bit or variable bit-length hash utilizing shake_128:

import hashlib
from typing import Dict, List


class Int8Hash:

    BYTES = 8
    BITS = BYTES * 8
    BITS_MINUS1 = BITS - 1
    MIN = -(2**BITS_MINUS1)
    MAX = 2**BITS_MINUS1 - 1

    @classmethod
    def as_dict(cls, texts: List[str]) -> Dict[int, str]:
        return {cls.as_int(text): text for text in texts}  # Intentionally reversed.

    @classmethod
    def as_int(cls, text: str) -> int:
        seed = text.encode()
        hash_digest = hashlib.shake_128(seed).digest(cls.BYTES)
        hash_int = int.from_bytes(hash_digest, byteorder='big', signed=True)
        assert cls.MIN <= hash_int <= cls.MAX
        return hash_int

    @classmethod
    def as_list(cls, texts: List[str]) -> List[int]:
        return [cls.as_int(text) for text in texts]

Usage:

>>> Int8Hash.as_int('abc')
6377388639837011804
>>> Int8Hash.as_int('xyz')
-1670574255735062145

>>> Int8Hash.as_list(['p', 'q'])
[-539261407052670282, -8666947431442270955]
>>> Int8Hash.as_dict(['i', 'j'])
{8695440610821005873: 'i', 6981288559557589494: 'j'}

To generate a 32-bit hash instead, set Int8Hash.BYTES to 4.

Disclaimer: I have not written a statistical unit test to verify that this implementation returns uniformly distributed integers.

Asclepius
  • 57,944
  • 17
  • 167
  • 143