Guid substring collision probability?

Question

How would one calculate the probability that two Guids start with the same N number of characters?

Situation:

We are considering using the first n characters from a guid as a cosmosdb collection partition key. We don't want to use the entire guid because we don't want every document to be in its own logical partition, but we also probably don't want to just use the first character of a guid as the partition key because we might then store too many documents in a partition and overflow the partition limit.

Example:

So if we use the first 4 ( number pulled randomly out of thin air) characters of a guid as the partion key, how can we calculate roughly how many documents will stored in each partition per month? For this example let's assume we're talking about partitioning 4 million documents a month.

Update

It sounds like every guid character has 16 potential values. 0-9 and a-f (hex char set). Assuming Guid characters are random ( I'm not sure this is true) there should be 16^4 possible four character guid starts (~65k combinations). Therefore, at most we'd have 65k partitions. And if we assume random distribution seems like 4,000,000 documents into 65,000 partitions should be roughly 61 documents per partition right?

Spreading data out evenly across partitions is only half of the equation. You also need to consider how you will be accessing the data. If you are partitioning with a random value and if you frequently run queries that return multiple documents it's going to seriously hinder performance because you'll be doing a lot of cross-partition queries. — Paul, Jan 10 '20 at 02:43
@Paul I understand and think we'll be fine. What were doing is kind of storing an index into collectionB in collectionA. CollectionA is what I'm referring to in my question. We will always be able to pass the shardkey into our find predicate into collectionA because our queries will always have the full guid to substring down to the shardkey value. — cobolstinks, Jan 10 '20 at 14:08

score 0 · Accepted Answer · answered Jan 10 '20 at 02:11

0

Actually,you could get the collection partition usage by REST API which shows the situation of data distribution.

No built-in preview feature in cosmos db to show the partition usage before storing data.If you do concern it before usage,you could calculate by yourself first.For example,use GROUP BY to partition data by the first 4 char with guid.

answered Jan 10 '20 at 02:11

Jay Gong

23,163
2
27
32

Sure I can use the API to get the distribution after the fact so I'll mark as the answer. I believe that the probability is roughly what I outlined in my update. – cobolstinks Jan 16 '20 at 14:37

Guid substring collision probability?

1 Answers1