2

I'm using a partitioned CosmosDb, but I don't know the value of the partition key each time I want to get a resource by its id. Now using the id as partition key is not a solution for me, since it would take too long and take up too much space (I heard the maximum numer of partitionkeys is 10GB, but I have way more than that. )

My idea is to manipulate 2 bytes of the GUID in order to map my partition key value to each guid. That way I don't have to use a cross partition query, but easily get the partition key value from my GUID. Are there any reserved bytes in a GUID that I can not change? Are there any best practicese for this problem?

Carmen
  • 55
  • 3
  • GUIDs are guaranteed unique via their *construction*. If you take a GUID and alter some bytes, you can no longer rely on their construction to guarantee uniqueness. – Damien_The_Unbeliever Mar 18 '19 at 15:30
  • @Damien_The_Unbeliever even if I only change 2 bytes? Clearly it's not as good as the original GUID but does it make that much of a difference? – Carmen Mar 18 '19 at 15:33
  • What do you mean it takes too long? Using the id as the partitionkey is fine as long as you know the is every time. The only 10 GB limit is per partition. Key value. You will never reach that because you only have one value per partition. – Nick Chapsas Mar 18 '19 at 20:17
  • You could just add your partition key to id and keep the GUIDs clean and by spec. Ex: "MyPK12_92FC117F-B8EB-4CB0-920D-73AC3711958D". – Imre Pühvel Mar 19 '19 at 08:38
  • @NickChapsas what I meant with 'it takes too long': Searching within a logial partition is way more efficient than searching over all resources. Also you can carry out transaction within a partition by stored procedures. Therefore it's better to have logical partitions. – Carmen Mar 21 '19 at 07:53

1 Answers1

3

There is a time-based component in the GUID which you can change and if you understand the consequences, will not result in a collision. You can read about some thoughts here:

https://blog.stephencleary.com/2010/11/few-words-on-guids.html

The relevant section reads:

Time-Based GUIDs (Version 1) Time-based GUIDs are Variant 2, Version 1 RFC 4122 GUIDs, also known as “sequential GUIDs” because they can be generated with values very close to each other. They consist of three fields in addition to Variant and Version: a 60-bit UTC Timestamp, a 14-bit Clock Sequence, and a 48-bit Node Identifier.

The Node Identifier is normally the MAC address of the computer generating the time-based GUID (which is guaranteed to be unique, since MAC addresses use a registration system). However, it may also be a 47-bit random value (with the broadcast bit set). In this case, there is no danger of collision with real MAC addresses because the broadcast bit of a physical MAC address is always 0. There is a danger of collision with other random node identifiers, though; specifically, there is a 50% chance of collision once 13.97 million random nodes enter the network.

Note: using a random value instead of the MAC address is not currently supported by Microsoft’s Win32 API. This means that any GUID generation done using UuidCreateSequential will expose the MAC address.

The Clock Sequence field is initialized to a random value and incremented whenever the system clock has moved backward since the last generated GUID (e.g., if the computer corrects its time with a time server, or if it lost its date and thinks it’s 1980). This allows 16,384 clock resets without any danger of a collision. If the GUIDs are being generated so quickly that the system clock has not moved forward since the last GUID’s timestamp, then the GUID generation algorithm will generally stall until the system clock increments the timestamp.

Sequential GUIDs are not actually sequential. In normal circumstances, GUIDs being generated by the same computer will have gradually increasing Timestamp fields (with the other fields remaining constant). However, the Timestamp field is not in the least-significant bit positions of the GUID, so if the GUID is treated as a 128-bit number, it does not actually increment.

It’s important to note that the likelihood of collisions of sequential GUIDs is extremely small. The Clock Sequence and Timestamp almost certainly uniquely identify a point in time, and the Node Identifier almost certainly identifies a unique source.

Sequential GUIDs can be created by the Win32 function UuidCreateSequential or by using uuidgen.exe from the Windows SDK passing the -x parameter.

You can also review the relevant specification here:

https://www.rfc-editor.org/rfc/rfc4122#section-4.1.1

Community
  • 1
  • 1
jwh20
  • 646
  • 1
  • 5
  • 15