0

I need to perform a nightly update in the datastore on a relatively large dataset (syncing a subset of corporate data with GAE). I've been using the bulkloader, and it does the job, but the write costs are really adding up. Since I'm specifying key strings for each entity, the bulkloader is essentially rewriting the ENTIRE entity for every record it loads, which in my case, is about 90 writes PER ENTITY. (It's a large, flat dataset with a lot of indexes.) But within my dataset, only six of my 50 properties actually change overnight, so I'm doing a lot of redundant writing.

My first thought was to keep a cache of the prior night's build, loop through it for changes, get the entity, then execute a put() on the properties that need it. This works effectively to reduce writes, but takes a LONG time -- even when I batch the put(). It only takes ~3 minutes to load the ENTIRE dataset with the bulkloader -- and 16-18 just to run the updates! (I'm using remote API, BTW.) This won't work when I scale up.

I tried using ndb.KeyProperty in my model and only updating the changed fields via bulkloader, but then I lose the abilty to query/sort on the keyProperty, which I need.

I also tried StructuredProperties, which DOES let you query/sort, but the structured property doesn't allow you to set an ID for it, so I can't load just the structured property.

So...is there a way for me to reduce these writes and keep the functionality I need? Can I use the bulkloader to update changes only? Do I need to restructure my dataset??

  • 1
    There is no way to do partial writes even if you change just a single property you rewrite the whole entity and the indexes. If its the same sub-set of properties that you update every time, you might consider storing these in a sub entity with the parent being the main record, and only update those sub records. – Tim Hoffman May 22 '14 at 14:24
  • Damn you're right: The Datastore API does not distinguish between creating a new entity and updating an existing one. If the object's key represents an entity that already exists, the put() method overwrites the existing entity. You can use a transaction to test whether an entity with a given key exists before creating one. See also the Model.get_or_insert() method.

    The problem with using the child/parent approach is that I sacrifice query/sort capabilities...
    – Simon Phoenix May 22 '14 at 15:21

0 Answers0