9

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.

I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.

This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.

David Underhill
  • 15,896
  • 7
  • 53
  • 61
aloo
  • 5,331
  • 7
  • 55
  • 94
  • I've created an issue for this, you can star it to help get it fixed: https://code.google.com/p/googleappengine/issues/detail?id=9044 – Sean Fujiwara Mar 31 '13 at 07:14

6 Answers6

5

"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.

N independent uniform random selections are not (in general) consecutive in any index.

QED.

You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • Thanks - I do really need randomized results so I guess I will have to use multiple datastore calls. I'll try to minimize N as much as I can I guess. – aloo Jul 09 '09 at 18:24
  • this isn't true. both [batch operations](https://developers.google.com/appengine/docs/python/datastore/entities?hl=en#Batch_Operations) and the [`IN`](https://developers.google.com/appengine/docs/python/datastore/queries#Property_Filters) query operator can return non-consecutive entities. – ryan Oct 02 '12 at 20:07
  • @ryan: same with `!=`. Both that and `IN` are implemented as a limited number of sub-queries. Batch ops aren't really relevant to the question, but yes, it's true that certain operations act on entities that aren't contiguous in any index. Just not searches. – Steve Jessop Oct 03 '12 at 08:29
  • good point! sorry, i missed the "query" part of your answer. you're definitely right in general. there is one exception: [merge join queries](https://developers.google.com/appengine/articles/indexselection#Improved_Query_Planner) do return non-contiguous entities. they were only used for a small class of queries at launch, but they've grown fast recently with the advanced query capabilities Alfred describes in the [article](https://developers.google.com/appengine/articles/indexselection). – ryan Oct 05 '12 at 00:43
  • @ryan: Actually I edited my answer in response to your comment, so I don't think you missed anything first time. Before, I wasn't clear that the "datastore calls" I'm referring to are queries. Interesting about the zigzag merge join -- I think I could probably rescue my "proof" with some lemma that the queries it performs are equivalent to queries that *are* contiguous in an index that could have been built but actually wasn't. But I'm not sure, so I won't try. I've commented before that every time I try to answer a GAE question in the negative, you guys add functionality :-) – Steve Jessop Oct 05 '12 at 08:54
5

What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.

Here's what you need to do:

When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)

Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.

This method does require dealing with a few annoying details:

  1. Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
  2. You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")

So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.

Peter Recore
  • 14,037
  • 4
  • 42
  • 62
  • 1
    You can more or less implement this using App Engine's built in numbering for IDs - if you know the maximum ID, you can pick some uniformly at random. Some won't exist, so retry them with new random ids, and so forth. If your ID space is dense, this will work fine. – Nick Johnson Jul 10 '09 at 09:44
  • sweet. I didn't know we could rely on the built in numbering to start at 1 and go up 1 by 1 from there. – Peter Recore Jul 10 '09 at 15:33
  • You can't - but it will allocate in blocks, and as long as the blocks get _mostly_ used, your retries should be small enough to be manageable. – Nick Johnson Jul 10 '09 at 22:16
3

Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.

Unfortunately this will require processing of all entities once if your datastore is already filled in.

It's weird, I know.

  • I think this is a great approach, and fits the NoSQL model of doing the work on write rather than read. Of course, this wouldn't be entirely random -- if you always got N sequential entries back then a user would occasionally see the same records next to each other. But that might be random enough for the OP. (You could also create several -- hundreds even -- properties with different random numbers and rotate which index you draw from.) – npdoty Feb 05 '11 at 22:02
1

I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.

However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".

I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share

Dirk
  • 11
  • 1
0

I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.

This selects "count" entries from "totalcount" entries, sorted by key.

    # select $count from the complete set
    numberlist = random.sample(range(0,totalcount),count)
    numberlist.sort()

    pagesize=1000

    #initbuckets
    buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
    for k in numberlist:
        thisb = int(k/pagesize)
        buckets[thisb].append(k-(thisb*pagesize))
    logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)

    #page through results.

    result = []
    baseq =  db.Query(MyEntries,keys_only=True).order("__key__")
    for b,l in enumerate(buckets):
        if len(l) > 0: 
            result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]

        if b < len(buckets)-1: # not the last bucket
            lastkey  = wq.fetch(1,pagesize-1)[0]
            wq = baseq.filter("__key__ >",lastkey)

Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.

And beware that if count is close to totalcount this can be very expensive. And beware that on millions of rows it might not be possible to do within appengine time boundaries.

svrist
  • 7,042
  • 7
  • 44
  • 67
-1

If I understand correctly, you need retrieve N random instance.

It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.

keys = MyModel.all(keys_only=True)

n = 5 # 5 random instance

all_keys = list(keys)
result_keys = []

for _ in range(0,n) 
    key = random.choice(all_keys)
    all_keys.remove(key)
    result_keys.append(key)

# result_keys now contain 5 random keys.
Arthur G
  • 117
  • 5
  • And if you have a million entities in your data store? Load all the keys from the data store - seems bad... – aloo Oct 03 '12 at 14:05
  • @aloo if you have so many instances, you may keep track total number of them in datastore and memcache, then you just do `random.choice` on range of number between 0 and total number. And after you just iterating on keys with index, that you generated. Or just use limit and offset. – Arthur G Oct 03 '12 at 15:41