I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video. Bucket Your Partitions Wisely
The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.
By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.
I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.
Using Guava Hashing:
public static String bucket(List<String> primKeyParts, int maxBuckets) {
StringBuilder combinedHashString = new StringBuilder();
primKeyParts.forEach(part ->{
combinedHashString.append(
String.valueOf(
Hashing.consistentHash(Hashing.sha512()
.hashBytes(part.getBytes()), maxBuckets)
)
);
});
return combinedHashString.toString();
}
The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).
I am far from being a hashing guru, hence I'm asking the following questions.
Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?
- Can I rely on the mentioned method to do the job?
- Is there a better solution of hashing large strings so they always will produce the same result for a given string?
- Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?
Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.
EDIT:
Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?
public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {
List<HashCode> hashCodes = new ArrayList();
for(String part : primKeyParts) {
hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
};
return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}