Java, how to hash a string with low collision probability, specify characters allowed in output to decrease this

Question

Is there any way to hash a string and specify the characters allowed in the output, or a better approach to avoid collisions when producing a hash of 8 characters in length.

I am running into a situation where I am seeing a collision with my current hashing method (see example implementation below). currently using crc32 from https://guava.dev/releases/20.0/api/docs/com/google/common/hash/Hashing.html

the hashes produced are alphaNumeric, 8 characters in length. I need to keep the 8 digit length (not storing passwords), Is there a way to specify an "Alphabet" of allowed output characters of a hashing function?

e.g. to allow (a-z, 0-9,) and a set of characters e.g. (_,$,-), the characters added will need to be URI friendly

This would allow me to decrease the possibility of collisions occurring.

The hash output will be stored in a cache for a maximum of 60 days, so collisions occurring after that period will have no affect

current approach example code:

import com.google.common.hash.HashFunction;
import com.google.common.hash.Hasher;
import com.google.common.hash.Hashing;

public class Test {
        private static final String SALT = "4767c3a6-73bc-11ec-90d6-0242ac120003";

        public static void main( String[] args )
        {
            // actual strings causing collisions removed as have to redact some data
            String string1 = "myStringOne";
            String string2 = "myStringTwo";

            System.out.println( "string1:" + string1);
            System.out.println( "string1 hashed:" + doHash(string1, SALT));
            System.out.println( "string2:" + string2);
            System.out.println( "string2 hash:" + doHash(string2, SALT));
        }

        private static String doHash(String keyValue, String salt){
            HashFunction func = Hashing.crc32();
            Hasher hasher = func.newHasher();
            hasher.putUnencodedChars(keyValue);
            hasher.putUnencodedChars(salt);
            return hasher.hash().toString();
        }
}

functionality of the code/problem statement using key store db. A user requests a resource, hash is made of (user details & requested resource). if resulting id already present -> return that item from DB

else, perform processing on resource and store in db, with result from hash as ID

cache is purged periodically.

Questions. Is there a way to specify the alphabet the hash is allowed to use in its output? I checked the docs but do not see an approach https://guava.dev/releases/20.0/api/docs/com/google/common/hash/Hashing.html

Or is there an alternative approach that would be recommended? e.g. generating a longer hash and taking a subset.

You could try using the printable ASCII character set, from to ~, which will give you the maximum range of printable characters within an 8 bit limit. — rossum, Jan 12 '22 at 18:21
@rossum thanks for the reply, the output of the hash would need to be URI safe, so I can not include all of the characters from the printable ASCII character set. will update the Q to reflect this — tpngr999, Jan 12 '22 at 18:29
Your premise seems weird and wrong, and I'm not understanding the context. If you want to eliminate collisions and store keys for some index (like unique keys for a web cookie), use a [Universal Unique Identifier.](https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/UUID.html) If you want hashing with no collisions, use a [perfect hash.](https://en.wikipedia.org/wiki/Perfect_hash_function) But your problems better suited for the UUID. — markspace, Jan 12 '22 at 18:31
@markspace thanks for your reply. I cannot use a UUID, as I need to be able to reproduce the value to avoid excess compute being done (added problem statement above). my understanding of perfect hashing requires the data to be close to static, part of the input is a file path, where files are continually added, so I do not think this will work for me — tpngr999, Jan 12 '22 at 18:57
I also don't quite understand the use case either. In general, a string of 8 characters from a 64 symbol alphabet like urlsafe base64 can encode 6 bits per character for a total of 48 bits or 6 bytes. There are any number of ways to generate a 6 byte hash depending on the quality and speed trade-offs you need to make. Look at the Hashing class you could take the low-order 6 bytes of `Hashing.sipHash24()` and then base64 encode those 6 bytes. — President James K. Polk, Jan 12 '22 at 22:44
... using base64url which is of course URL safe. But regardless, you'd still only have 48 bits of output, and for collisions you need to take the birthday bound into account. Generally I'd say that 1 in 2^64 is negligible, but you are already below that. It depends on the number of values how much. Generally you can approx. divide by the number of entries, e.g. if you've got a million entries then you'd have one in 2^48 / 2^20 = 2^28 chance of collision. Look [here](https://en.wikipedia.org/wiki/Birthday_attack#Simple_approximation) for more precise calculations. — Maarten Bodewes, Jan 13 '22 at 01:26
Are you allowed to use [percent encoding](https://en.wikipedia.org/wiki/Percent-encoding) in the URL? It would free up a few characters or it could even encode [binary data](https://en.wikipedia.org/wiki/Percent-encoding#Binary_data). — Maarten Bodewes, Jan 13 '22 at 01:32

Java, how to hash a string with low collision probability, specify characters allowed in output to decrease this

0 Answers0