21

I have a datastore with around 1,000,000 entities in a model. I want to fetch 10 random entities from this.

I am not sure how to do this? can someone help?

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
demos
  • 2,630
  • 11
  • 35
  • 51
  • 1
    possible duplicate of [Querying for N random records on Appengine datastore](http://stackoverflow.com/questions/1105004/querying-for-n-random-records-on-appengine-datastore) – Jader Dias Aug 31 '11 at 19:32

2 Answers2

23

Assign each entity a random number and store it in the entity. Then query for ten records whose random number is greater than (or less than) some other random number.

You'll also need to sort on your random number column, otherwise, Google App Engine will pick 10 entries that are greater (or less than) your number, but it will pick them in a non random way. So, if you are picking records whose random number is greater than a random number, you'd sort asending on the column, otherwise you'd sort decending.

This isn't totally random, however, since entities with nearby random numbers will tend to show up together. If you want to beat this, do ten queries based around ten random numbers, but this will be less efficient.

Craigo
  • 3,384
  • 30
  • 22
Jason Hall
  • 20,632
  • 4
  • 50
  • 57
  • Exactly right. Might want to mention the range (0..1 is standard) for the random numbers. – Nick Johnson Jun 09 '10 at 09:16
  • 6
    One possibility to increase randomness without hurting read-time efficiency would be to enqueue a task to assign new random numbers to the entities you fetched, so if you hit one of them again you won't get the same neighbors with it. – Wooble Jun 09 '10 at 11:35
  • @NickJohnson could you clarify about the standard range? Sorry, I didn't understand what you meant by (0..1)? Also, to both of y'all: I'm worried about using up my one inequality filter for this operation (because in some queries I need it to be random but at the same time run an equality filter on another property). How bad is it to do 10 queries, is it basically 10x the cost? – iceanfire Nov 01 '13 at 00:04
4

Jason Hall's answer and the one here aren't horrible, but as he mentions, they are not really random either. Even doing ten queries will not be random if, for example, the random numbers are all grouped together. To keep things truly random, here are two possible solutions:

Solution 1

Assign an index to each datastore object, keep track of the maximum index, and randomly select an index every time you want to get a random record:

MyObject.objects.filter('index =', random.randrange(0, maxindex+1))

Upside: Truly random. Fast.

Down-side: You have to properly maintain indices when adding and deleting objects, which can make both operations a O(N) operation.

Solution 2

Assign a random number to each datastore number when it is created. Then, to get a random record the first time, query for a record with a random number greater than some other random number and order by the random numbers (i.e. MyObject.order('rand_num').filter('rand_num >=', random.random())). Then save that query as a cursor in the memcache. To get a random record after the first time, load the cursor from the memcache and go to the next item. If there is no item after the first, run the query again.

To prevent the sequence of objects from repeating, on every datastore read, give the entity you just read a new random number and save it back to the datastore.

Up-side: Truly random. No complex indices to maintain.

Down-side: Need to keep track of a cursor. Need to do a put every time you get a random record.

Community
  • 1
  • 1
speedplane
  • 15,673
  • 16
  • 86
  • 138
  • "Even doing ten queries will not be random if, for example, the random numbers are all grouped together" - I presume you're talking about the random numbers that were assigned to the datastore rows. This is only an issue for small numbers of records - the standard deviation of gaps between values shrinks as the number of values increases, to the point where it's statistically insignificant. Your solution 1 requires a monotonic counter, which is a slow and expensive operation on App Engine. Solution 2 uses selection without replacement, which is different to what the OP was asking for. – Nick Johnson Jul 11 '12 at 00:46
  • Right, the naive approach breaks down if there are not many records or if you are retrieving them at a high rate. Also, once the rand_num values are set, their distribution is fixed. You won't get a good uniform distribution and there will be certain records that will only rarely be selected. – speedplane Jul 11 '12 at 03:36
  • No, that was my point - the larger the number of records you have, the smaller the standard deviation in interval. That is, there will be proportionally fewer entities that have abnormally small intervals assigned to them. Wooble's suggestion of reassigning numbers once you select a record would also help counteract this. – Nick Johnson Jul 11 '12 at 04:10