46

Any ideas how to pick an item/record randomly from a DynamoDB table? I don't believe there are any provisions for this in the API.

I thought about maintaining a table of NumericId|MyOtherKey ("NumericIdTable") and then generating a random number between 0 and the total number of records I have, then getting that item from NumericIdTable but it's not going to work in the long-run.

Thoughts/ideas welcome.

ben
  • 1,448
  • 1
  • 17
  • 12

6 Answers6

36

One approach I came up with to pick a random item from a DynamoDB Table:

  1. Generate a random RangeKey over all possible RangeKeys in your Table
  2. Query the Table with this RangeKey and the RangeKeyCondition GreaterThan and a Limit of 1

For example if you use a UUID as Identifier for your RangeKey you could get your random Item like the following

RandomRangeKey = new UUID
RandomItem = Query( "HashKeyValue": "KeyOfRandomItems",
                    "RangeKeyCondition": { "AttributeValueList":
                                "RandomRangeKey",
                                "ComparisonOperator":"GT"}, 
                    "Limit": 1 )

This way you get a random Item and only consume 1 read capacity.

There is a chance to miss the first query for a random variable by generating a smaller UUID than the smallest one used in the table. This chance scales down with the table scaling up and you can easily send another request using the SmallerThan Comparison on the same random key, which then ensures a hit for a random item.


If your Tabledesign doesn't allow randomizable RangeKeys you could follow your approach and create a separate RandomItem table and store the ID under a randomizable RangeKey. A possible table structure for this would be

*RandomItemTable
   TableName - HashKey
   UUID - Rangekey
   ItemId

Keep in mind, for this approach you need to manage the redundancy between the original table and the randomization table.

nenTi
  • 770
  • 5
  • 8
  • 2
    Thanks for this nenTi - I will have a look at implementing this. I must admit I didn't think about using a GT comparison operator on a UUID - nice idea :) – ben May 22 '12 at 08:21
  • 5
    DynamoDB query needs to specify a hash key. the above answer will work if you want to get a random row for a specific hash key. if you want to get a 'global' random item then it won't work :( – Erben Mo Jun 20 '14 at 03:39
  • I'm not sure if this is because it's old, but `RangeKey == SortKey` and `HashKey == PartitionKey` from what I could grok. You can't use `>`/`GT` on a `PartitionKey unfortunately and need to specifically have a sort key... – Cobertos Feb 01 '19 at 03:53
  • 4
    It is 2019 now. Is there any better solution for this? – t_sologub Jul 21 '19 at 04:44
35

If you're using GUID as your Hash Key for the table, you can do something like this:

var client = new AmazonDynamoDBClient();

var lastKeyEvaluated = new Dictionary<string, AttributeValue>() 
{ 
    { "YOUR_HASH_KEY", new AttributeValue(Guid.NewGuid().ToString()) } 
};

var request = new ScanRequest()
{
    TableName = YOUR_TABLE_NAME,
    ExclusiveStartKey = lastKeyEvaluated,
    Limit = 1
};
var response = client.Scan(request);

This will give you a random record every time since it generates a random GUID as the lastKeyEvaluated.

Automatico
  • 12,420
  • 9
  • 82
  • 110
cmilam
  • 800
  • 7
  • 13
  • 3
    So by setting the limit to 1, you'll only get one item. but how does setting ExclusiveStartKey to a random UUID that doesn't exist yet give you another random row?? Doesn't that generated UUID have to exist in the table already? And by its uniqueness, that'll never happen – azizj Mar 18 '17 at 14:58
  • 15
    The UUID doesn't need to exist in the table. For a given key, DynamoDB knows where it "should" exist, if it did. When you choose a random one, DynamoDB starts at that location, and moves to the next item, and returns it. It's similar to finding a random house on a street: choose a house number, go to where that house number should be, then move up the street until you find an actual house. – Trenton Jan 30 '18 at 22:11
  • Also, the value for YOUR_HASH_KEY can be generalized to choosing 2048 random bits, treating that like a string, and using that as the starting point. See https://stackoverflow.com/questions/5351277/ for how to do that in Java. – Trenton Jan 30 '18 at 22:22
  • I was skeptical but you really can use a random value for your partition key. If you have a sort key, you ALSO need to provide a value or you get an error about it not matching the schema: https://stackoverflow.com/questions/39989567/aws-dynamodb-scan-using-exclusivestartkey-option – Cobertos Feb 01 '19 at 03:51
  • 1
    Presumably if no items are returned, you should perform a follow-up query to grab the first item. If you don't, and you have a set containing only 1 item, and that item happens to have a very low UUID, it's possible your approach will take an untenably large number of requests until a single item is returned. – Lawrence Wagerfield Mar 20 '21 at 21:40
4

The naive way would be 1) use describe table call to get N (the total number of rows) in this table 2) select a random number i between 1 and N 3) scan. stop until you have seen i rows

I am thinking about a better way to do this. I will update when I have a good answer.

Erben Mo
  • 3,528
  • 3
  • 19
  • 32
0

One easy and effective way of doing this:

  1. Retrieve all the items in your table. Since DynamoDB has a limit of fetching only 1MB of data, use ProjectionExpression to retrieve only your Primary partition key in this operation.
  2. From the above result, you will get the total number of items. Now, just generate a random number between 0 and the item count. Let's call this random number, n.
  3. From the result in 1, take the n-th item which would be the Primary partition key value of that random item.
  4. Do another DynamoDB query with the Primary partition key value of the random item you just calculated, to get all the necessary column values.
Amit
  • 1,620
  • 1
  • 15
  • 24
  • 3
    Problems with this: 1. What if your data, even just the keys, is greater than 1MB? You're not selecting randomly over your entire data, then, unless you can guarantee that the items dynamo returns each time is truly random. 2. If your scan for just the keys is near 1MB then that's a huge operation every time you need a random row. – fivedogit Dec 20 '19 at 15:20
0

I have had the same issue, and solved it in a similar way to nenTi's suggestion, with some reservations. As stated in the comments, the method only works if selecting a range key at random over a set of items with the same partition key - so using it with the table's keys interferes with the best practice of having the partition key as distributed as possible.

I solved this by adding a GSI (global secondary index) with a hash key representing a large category of items to choose from (the category doesn't need to be distributed!) and a sort key with random values.

querying this GSI produced efficient random result (0.5 RCU per request out of a ~90KB database, only a single item scanned).

The query i used below:

response = table.query(
        Limit=1,
        IndexName="my_index",
        KeyConditionExpression=Key("type").eq("Q") & Key(random_property).gt(Decimal(str(Random().random()))),
        ReturnConsumedCapacity='TOTAL'
    )
Elad Levy
  • 23
  • 1
  • 7
-1

I would recommend the next approach for some use cases:

  1. Create a script that scans all dynamodb table and save "pk" and "sk" to separate file.

  2. Save all "pk" and "sk" pairs in the lambda layer as a local dependency (search for "Local dependency in package.json" if you use nodejs).

  3. Import local dependency in lambda and choose a random item.

  4. Run script regularly for a refreshing set of "pk" and "sk" in a local dependency.

I think this approach is good enough if you don't have an extremely huge dynamodb table and there is no requirement to include new "pk" and "sk" very often.

SAndriy
  • 670
  • 2
  • 15
  • 25