Are keys-only Datastore queries that return thousands of results unavoidably slow?

Question

I have an App Engine app using the go115 runtime that interacts with Cloud Datastore via the google.golang.org/appengine/v2/datastore package. The app inserts and queries Song entities with automatically-assigned integer IDs and a Rating field (along with other fields):

type Song struct {
    // ...
    Rating float64 `json:"rating"`
    // ...
}

With 19,900 Song entities, https://console.cloud.google.com/datastore/stats reports a data size of 15.53 MB, a built-in index size of 102.27 MB, and a composite index size of 86.31 MB. The app is running in us-central and Datastore is in nam5 (us-central).

The following keys-only query consistently takes 2-4 seconds to return 13,571 keys when run from App Engine:

q := datastore.NewQuery("Song").KeysOnly().Filter("Rating >=", 0.75)
keys, err := q.GetAll(ctx, nil)

Using Run instead of GetAll and collecting the IDs while iterating seems to be a tiny bit faster. I also tried using the cloud.google.com/go/datastore package from my laptop just to make sure that the older appengine package isn't particularly slow, but the results were far worse (7-9 seconds).

My understanding (based on things like the Indexes document and the I/O 2010 "Next gen queries" talk) is that Datastore automatically maintains an index for the Rating field. I would expect it to be trivial to satisfy this query using that index. The integer IDs in the result set seem like they should consume about 100 KB of space.

I haven't found any official guidance from Google about what type of performance to expect from Datastore, but this is far slower than I'd hoped for. When I perform a brute-force approximation of this query on my laptop by running the jq tool and grep against a JSON dump of the data, it only takes 400 ms.

Is it expected that a trivial keys-only Datastore query will take multiple seconds to return thousands of keys?

When I run keys-only queries that return smaller result sets, the performance is better: queries that return 300-400 keys often take 100-200 ms, while queries that return fewer than 200 keys complete in 40-70 ms.

I'm unable to modify many of my queries to limit the number of results they return:

I sometimes perform multiple queries in parallel and then manually intersect or subtract their results to get around the Datastore limitation that inequality filters are limited to at most one property.
I sometimes get the full result set in order to return a random subset of it. I've read about approaches people take to simulate randomized results when using Datastore, but I don't think they'll work here since my queries already use inequality filters.

I can think of hacks that may help in some cases (e.g. sharding/partitioning my entities so I can run multiple queries in parallel), but I'd prefer to avoid that if possible.

score 1 · Answer 1 · answered Feb 01 '22 at 12:21

1

I would say that a keys-only query that returns 13,571 keys in 2-4 seconds is reasonably fast. It is still a query after all.

If the total number of possible queries is not that large, maybe you could store the query results in advance?

answered Feb 01 '22 at 12:21

new name

15,861
19
68
114

I currently cache query results in a Datastore entity that maps from query hash to result IDs, but the number of potential queries is large and their results are frequently invalidated by updates. It feels to me like Datastore imposes many restrictions in order to guarantee good performance, so I'm surprised that this query doesn't take, say, less than 100 ms -- aren't all of the keys that are being returned already present in the index? – derat Feb 01 '22 at 16:30

Are keys-only Datastore queries that return thousands of results unavoidably slow?

1 Answers1