How to generate a string hash, with a custom alphabet and length, while minimizing collisions

Question

Assume I need to generate the hash for a string where the hash itself can be max N characters long in a given alphabet, e.g. all alphanumeric characters [a-zA-Z0-9] plus the symbols !?-=.

One trivial approach would be to use well-known hash algorithms, such as SHA-1, then truncate the output. Assuming N is 10 and the alphabet is a superset of hex, here is a trivial solution in Javascript:

var crypto = require('crypto')
var shasum = crypto.createHash('sha1')
shasum.update('foo')
var hash = shasum.digest('hex') // => "0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33"
var shortHash = hash.substr(0, 10) // => "0beec7b5ea"

While this respects the conditions of length and alphabet, it is clearly not optimal because it does not exploit the (much) larger hashing space that the full alphabet above could allow.

Moreover, is the increment of collision probability of a truncated SHA-1 hash actually proportional to the reduction of the hashing space, or is it more than that (e.g. caused by internal correlations between bits)?

Disclaimer: This is not intended for security-critical applications, and I am aware of the increased collision probability. The goal of the question is purely to understand whether there is a theoretically optimal way of achieving what is described above.

You are conflating two distinct issues: hashing and encoding/decoding. Hash functions that we are most familiar with operate on variable length byte (or bit) sequences and return a fixed number of bytes(or bits) as output. You can then encode the output using whatever alphabet that's appropriate based on your specific requirements. — President James K. Polk, Jun 09 '23 at 18:25
As you said, well-know algorithm output a fixed number of bytes. So the question is how to get a different number of output bytes while still relying on a strong algorithm? Are there other hashing algorithms that are designed for this? — Marco Ancona, Jun 09 '23 at 19:50

score 1 · Accepted Answer · answered Jun 29 '23 at 12:29

After some research, here is the solution I've landed on in Node, using SHA-256 and Base-x.

import crypto from "crypto";
import basex from "base-x";

const base62 = basex(
  "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
);

const DEFAULT_LENGTH = 15;

function shortHash(input: string, precision = DEFAULT_LENGTH) {
  return base62
    .encode(crypto.createHash("sha256").update(input).digest())
    .slice(0, precision);
}

How it works and assumptions

First, the input is hashed using the Node crypto module. Here I use sha256 but other algorithms are possible too.
Then, the resulting buffer is encoded to base 62. Here the assumption is that the desired alphabet is alphanumeric characters (62 symbols). A different base would need to be chosen with a different number of symbols.
The result is sliced to the desired length. Here we rely on the assumption that any substring of the sha256 output has the same entropy. However, I could not find theoretical results showing that this is optimal.

How to generate a string hash, with a custom alphabet and length, while minimizing collisions

1 Answers1

How it works and assumptions