How to model geography for datastore inserts?

Question

I have a primitive data model with users that can make inserts that should have a city and a region. To prepare for articles that are connected with several cities (so that for instance a product offering that is the same in two or more cities will become only one article with a list of cities instead of duplicated articles for every single city that the article is connected to).

class Region(db.Model):
    name = db.StringProperty()
    countrycode = db.StringProperty()
    vieworder = db.IntegerProperty() # custom ORDER BY variable to order by population 
    areacode = db.IntegerProperty()
    areacodes = db.ListProperty(int)


class City(db.Model):
    region = db.ReferenceProperty()
    name = db.StringProperty()
    vieworder = db.IntegerProperty()
    areacode = db.IntegerProperty()

So I could manage to make the storage and the views but the data model is not good.

class Article(db.Model):
    cities = db.ListProperty(db.Key)
    regions = db.ListProperty(db.Key)

At the insert it is coded:

if self.request.get('area'):
    city = model.City.get_by_id(long(self.request.get('area')))
    region = model.Region.get(city.region.key())
    article.cities.append(city.key())
    article.regions.append(region.key())
    article.city = unicode(city.name)
    article.region = unicode(region.name)
    article.put()

This generates redundancy and is not very pretty (and not in 1NF since it it saved a list in a field).

enter image description here

When building the index for the search API I so far only use one city, but I plan to handle lists of cities and lists of regions (although a city can never be in two regions, so everything but a city list is actually redundand, but I save erdundancy to avoid lengthy lookups at searches and views). I wonder if I used referenceproperties and keys correctly and if I would be better off using the NDB models instead?

score 3 · Accepted Answer · answered Mar 20 '14 at 13:31

First of all, if you're caring about 1NF and normalization forget about the datastore, you need Google Cloud SQL (MySQL) or any other relational database. Wikipedia says :

First normal form (1NF) is a property of a relation in a relational database

If you build your app on this "NOSQL", hierarchical, key-value datastore, it means that you're in need of scalable performance. Joins are expensive and disk space is not : duplicate, denormalize your data so you can access it fast. Forget about the "right and wrongs" you learned about RDBMS modeling. You're in another world there.

In your case, if you want your app to scale, you need to duplicate your region/city/country/whatever on your articles so you don't have to make several additional queries. Your use case fits the datastore, since your city is not going to change names.

To make it simple : model your data so you can get everything you need for your use case with one query -- without breaking the datastore limitations (entity sizes, write throughput, quotas...).

score 3 · Answer 2 · answered Mar 20 '14 at 15:10

A big plus 1 to the other answer from @Michael.

In a addition I would suggest you move over to ndb, db.ReferenceProperty is problematic in that you have to jump through hoops to ensure you fetch the references efficiently. Looping and dereferencing with mycity.region is very expensive because of the multiple roundtrips to the datastore.

If you want to use db, look at Nick Johnsons article on prefetching reference properties.

http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine

But as I said I would just move to ndb is you haven't invested too much in db code.

So on to modeling your data.

From your data model and comments we know that a city can only belong to a single region,

So consider making the region an ancestor of the city when creating the City rather than holding a reference or key property. Then the region is part of the key, and you can just fetch the parent (region of the city) when required with city.key.parent() if your using ndb.

I don't see any problem with storing regions/cities in the articles. However if you always query by city then having a region as an ancestor means you could just store the city.

The only potential downside would be if you have large numbers of cities and regions assigned to a single article. Remember entity size is limited to 1MB.

I am not sure why you are storing the city name, and region name in your article because you have potentially many cities and regions. You could easily make the id of each city and region the name, then if you need the name in a summary listing of cities and regions for the article you have it already in the key and can potentially skip fetching the entity at that point.

I suppose specific approaches will be very much dictated by the anticipated number of cities/regions associated with each article and how your queries are actually structured.

I would also consider collapsing cities and regions into a single list property (or repeated property) because if you want to list all articles related to citya or regionb you have to perform two queries. If they are both stored as keys in a single list (e.g. locations) , then you can just do a single query like

Article.filter(Article.locations.IN([citya_key, rebionb_key])

This query searches for entities whose locations value (regarded as a list) contains at least one of those values.

I would work through your specific query use cases and then refine your model from there.

How to model geography for datastore inserts?

2 Answers2