I'm working on large scale component that generates unique/opaque tokens representing business entities. Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).
The system itself is horizontally scaled but needs token generation to be idempotent; data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.
The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data. Note there is no extra code complexity since we implement attributevalue.Marshaler
interface to bind UUID
(language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.
My question relates to (mostly) data size/saving. Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes]
, for example two UUIDs concatenated together into 32 bytes.
I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this. My question is really three parts:
- Are the PK/SK index size 'reserved' (i.e. padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? (... I read somewhere that 100 bytes is typically reserved.
If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped. (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.)
- Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)? i.e. if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?
This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data. UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others.
- Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?
This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.