Random Sampling of size N in Dynamo DB without full Table scan

Question

I am new to dynamodb & was having some trouble in finding a way to randomly getting items without a full table scan ,most of the algorithms that i found consist of full table scans I am also taking the case where we don’t have additional information of the table(Like columns and column Type such info is unknown) Is there a way exist to do so

cementblocks · Answer 1 · 2021-08-23T00:46:04.130

1

You can randomly sample by using a randomly generated exclusive start key for the scan or query operation. The exclusive start key does not have to match a record in the table. It just needs to follow the key structure of the table/index.

edited Aug 23 '21 at 00:46

answered Aug 18 '21 at 15:08

cementblocks

4,326
18
24

score -1 · Answer 2 · answered Aug 18 '21 at 12:18

As with most questions about queries in DynamoDB, how you structure your data depends on how you want to query it.

For something like a random sampling, you have to make it confirm to the following core constraint of DynamoDB:

You have to provide a partition key
You can provide a sort key

So with a "single table" type design, you could structure your data something like this:

PK	SK	myVal
my_dict	6caaf1e3-eb8d-404a-a2ae-97d6682b0224	foo
my_dict	1c5496e8-c660-4b4e-980f-4abfb1942863	bar
my_dict	56551340-fff8-4824-a5be-70fcaece2e1a	baz
my_other_dict	520a7b37-233c-49dd-87da-77d871d98c92	test1
my_other_dict	65ccd54e-72c3-499d-a3a7-0cd989252607	test2

The PK is the identifier for your collection of random things to look up. The SK is a random UUID. And myVal contains the value you want to be returned.

You can query this db the following way:

SELECT * FROM "my-table" WHERE PK = 'my_dict' AND SK < '06a04e20-b239-48f2-a205-552eb61fef35'

By querying with an UUID as the SK, you'll get the first item in the table with an UUID close to the one you query for. By using a random uuid each time you query, you'll get a random result back.

The particular query above actually returns nothing, so you need to retry until you get a result.

Also, I haven't done the math (who has?), but I'd imagine that periodic queries like this won't generate perfectly random distributions, especially for small data sets.

Primary key has to be unique, you cannot put multiple items having the same PK in dynamoDB. DynamoDB is a key-value store, not a SQL database. — aherve, Aug 22 '21 at 07:40
@aherve I thought the primary key can be either of 1) partition key or 2) partition key + sort key. In the second case, partition key does not have to be unique. https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/ — nullforce, Aug 23 '21 at 00:57
@nullforce Fair point. However doing this would force dynamo to write an entire "table" on the same shard. While being technically possible, this approach will eventually lead to terrible performances as the data won't be distributed enough. Almost all of the benefits of _dynamoDB_ comes from the fact that the data is supposed to be highly distributed — aherve, Aug 23 '21 at 06:23
@aherve that's true, but also not true :) DynamoDB has lots of mechanisms to deal with hot partitions etc. See this article from 2019: https://aws.amazon.com/blogs/database/choosing-the-right-number-of-shards-for-your-large-scale-amazon-dynamodb-table/ — August Lilleaas, Aug 23 '21 at 08:07

Random Sampling of size N in Dynamo DB without full Table scan

2 Answers2