9

I'm seeing very poor performance when fetching multiple keys from Memcache using ndb.get_multi() in App Engine (Python).

I am fetching ~500 small objects, all of which are in memcache. If I do this using ndb.get_multi(keys), it takes 1500ms or more. Here is typical output from App Stats:

App Stats and RPC Stats

As you can see, all the data is served from memcache. Most of the time is reported as being outside of RPC calls. However, my code is about as minimal as you can get, so if the time is spent on CPU it must be somewhere inside ndb:

# Get set of keys for items. This runs very quickly.
item_keys = memcache.get(items_memcache_key)
# Get ~500 small items from memcache. This is very slow (~1500ms).
items = ndb.get_multi(item_keys)

The first memcache.get you see in App Stats is the single fetch to get a set of keys. The second memcache.get is the ndb.get_multi call.

The items I am fetching are super-simple:

class Item(ndb.Model):
    name = ndb.StringProperty(indexed=False)
    image_url = ndb.StringProperty(indexed=False)
    image_width = ndb.IntegerProperty(indexed=False)
    image_height = ndb.IntegerProperty(indexed=False)

Is this some kind of known ndb performance issue? Something to do with deserialization cost? Or is it a memcache issue?

I found that if instead of fetching 500 objects, I instead aggregate all the data into a single blob, my function runs in 20ms instead of >1500ms:

# Get set of keys for items. This runs very quickly.
item_keys = memcache.get(items_memcache_key)
# Get individual item data.
# If we get all the data from memcache as a single blob it is very fast (~20ms).
item_data = memcache.get(items_data_key)
if not item_data:
    items = ndb.get_multi(item_keys)
    flat_data = json.dumps([{'name': item.name} for item in items])
    memcache.add(items_data_key, flat_data)

This is interesting, but isn't really a solution for me since the set of items I need to fetch isn't static.

Is the performance I'm seeing typical/expected? All these measurements are on the default App Engine production config (F1 instance, shared memcache). Is it deserialization cost? Or due to fetching multiple keys from memcache maybe? I don't think the issue is instance ramp-up time. I profiled the code line by line using time.clock() calls and I see roughly similar numbers (3x faster than what I see in AppStats, but still very slow). Here's a typical profile:

# Fetch keys: 20 ms
# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms

Update: Out of interest I also profiled this with all the app engine performance settings increased to maximum (F4 instance, 2400Mhz, dedicated memcache). The performance wasn't much better. On the faster instance the App Stats timings now match my time.clock() profile (so 500ms to fetch 500 small objects instead of 1500ms). However, it seem seems extremely slow.

alzy
  • 307
  • 1
  • 10
  • 1
    Are you testing the production environment or the SDK. – Tim Hoffman Jun 04 '14 at 06:55
  • 1
    All the tests shown are from production. – alzy Jun 04 '14 at 06:57
  • 1
    500 is a lot of objects. Each needs to be deserialized and re-constituted and that will take time. Is the single get that bad, are you likely to get a really good cache hit ratio on items. What happens when you don't have all 500 in memcache but only half/quarter. How do you even identify the set of 500 keys if it isn't a static set of keys ? – Tim Hoffman Jun 04 '14 at 07:15
  • 1
    I believe the objects are serialized/deserialized using protobufs and they aren't particularly quick (at least that used to be the case). – Tim Hoffman Jun 04 '14 at 07:16
  • I'm mostly interested to understand the basic reason why this is slow, so it can inform my future design choices. I can probably solve this specific issue by redesigning things to be a single memcache get. Deserialization sounds like the likely cause, but I wondered why AppStats shows a large banks space both before and after the memcache RPC. Would have expected it only after the RPC if it was all deserialization. And if it is the cause, is there any reasonable way to avoid deserialization penalty, e.g. somehow deal with protobufs directly? – alzy Jun 04 '14 at 07:27
  • Your appstats graph show two rpcs and 2 long periods. The way I read it the gaps are after each rpc, not beforea and after. I assume that two memcache operations are being peformed, and the long periods are the deserialization of two sets. Have you compared the same operation with an empty memcache ? – Tim Hoffman Jun 04 '14 at 07:30
  • You can drill down and see what the rpc transactions detail is. – Tim Hoffman Jun 04 '14 at 07:30
  • I don't see anyway of avoiding the deserialization cost if you want to do anything with the entities, other than flick them to a remote system that could decode them there (which I think would be problematic). – Tim Hoffman Jun 04 '14 at 07:34
  • Does your code do anything with the data ? For instance templating/loops over the objects ? – Tim Hoffman Jun 04 '14 at 07:35
  • No, if I comment out the `ndb.get_multi(item_keys)`, the first memcache.get runs in <20ms and the function returns immediately. So the long periods both seem to be associated with `ndb.get_multi(item_keys)`. And I'm not doing anything else. The code shown here is literally what I'm profiling. – alzy Jun 04 '14 at 07:36
  • All I can suggest is your profile a bit deeper and follow the code in more detail. I do feel that what you are testing is quite static in nature, but you want to support a more dynamic solution. For instance what are you going to do if you set of keys from your first memcache get is missing ? – Tim Hoffman Jun 04 '14 at 07:41
  • So I looked at the RPC drill-down, and it seems like the problem is just general slowness of ndb. The first long gap seems to be caused by ndb setting up future objects for each key. The second long gap seems to be due to handling the response (presumably deserialization). I'm shocked it is so slow, given I'm only fetching about 100 kB of data. – alzy Jun 04 '14 at 07:51
  • As you have discovered a lot is going on under the hood. It might be worth trying async versions too. But it is also quite possible what you are doing doesn't suit ndb or appengine for that matter. Try the same thing with db, and do the cache management yourself. – Tim Hoffman Jun 04 '14 at 07:54
  • Yeah, I agree my overall design here might not be optimal. I was more interested to understand the general performance characteristics of App Engine so I can get a feel for what I'm working with. Basically it seems lots of gets are a no-no, whereas I expected it would be OK to have lots of gets if everything was in memcache. – alzy Jun 04 '14 at 07:56
  • 2
    Think a bit more about what is going on. ndb.get_multi is going to try and get each key from potentially instance level cache, memcache and then fail and go to the datastore. It's not just a fetch a bunch of items with this list of keys. For 500 keys that is a lot of work to be setup to garuntee you get all the keys in your list of 500 – Tim Hoffman Jun 04 '14 at 07:56

1 Answers1

10

I investigated this in a bit of detail, and the problem is ndb and Python, not memcache. The reason things are so incredibly slow is partly deserialization (explains about 30% of the time), and the rest seems to be overhead in ndb's task queue implementation.

This means that, if you really want to, you can avoid ndb and instead fetch and deserialize from memcache directly. In my test case with 500 small entities, this gives a massive 2.5x speedup (650ms vs 1600ms on an F1 instance in production, or 200ms vs 500ms on an F4 instance). This gist shows how to do it: https://gist.github.com/mcummins/600fa8852b4741fb2bb1

Here is the appstats output for the manual memcache fetch and deserialization: app stats for manual memcache fetch and deserialization

Now compare this to fetching exactly the same entities using ndb.get_multi(keys): ndb fetch of same items

Almost 3x difference!!

Profiling each step is shown below. Note the timings don't match appstats because they're running on an F1 instance, so real time is 3x clock time.

Manual version:

# memcache.get_multi: 50.0 ms
# Deserialization:  140.0 ms
# Number of keys is 521, fetch time per key is 0.364683301344 ms

vs ndb version:

# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms

So ndb takes 1ms per entity fetched, even if the entity has one single property and is in memcache. That's on an F4 instance. On an F1 instance it takes 3ms. This is a serious practical limitation: if you want to maintain reasonable latency, you can't fetch more than ~100 entities of any kind when handling a user request on an F1 instance.

Clearly ndb is doing something really expensive and (at least in this case) unnecessary. I think it has something to do with its task queue and all the futures it sets up. Whether it is worth going around ndb and doing things manually depends on your app. If you have some memcache misses then you will have to go do the datastore fetches. So you essentially end up partly reimplementing ndb. However, since ndb seems to have such massive overhead, this may be worth doing. At least it seems so based on my use case of a lot of get_multi calls for small objects, with a high expected memcache hit rate.

It also seems to suggest that if Google were to implement some key bits of ndb and/or deserialization as C modules, Python App Engine could be massively faster.

alzy
  • 307
  • 1
  • 10