MongoDB: Cluster documents by geographic location given area and max points?

Question

I'm trying to develop a map-based visualization which includes a "heat map" of subpopulations, based on a MongoDB collection that contains documents like this:

{
    "PlaceName" : "Boston",
    "Location" : {
        "type" : "Point",
        "coordinates" : [ 42.358056, -71.063611 ]
    },
    "Subpopulations": {
        "Age": { 
                "0_4" : 37122,
                "6_11" : 33167,
                "12_17" : 35464,
                "18_24" : 130885,
                "25_34" : 127058,
                "34_44" : 79092,
                "45_54" : 72076,
                "55_64" : 59766,
                "65_74" : 33997,
                "75_84" : 20219,
                "85_" : 9057
        }
    }
}

There are hundreds of thousands of individual locations in the database. They do not overlap -- i.e. there wouldn't be two individual entries for "New York City" and "Manhattan".

The goal is to use Leaflet.js and some plugins to render various visualizations of this data. Leaflet's quite good at clustering data client-side -- so if I passed it a thousand locations with density values, it could render a heat map of the relevant area just by crunching all the individual values.

The problem is, say I zoom out in the map to show the whole world. It would be horribly inefficient, if not impossible, to send all that data to the client and have it process that info quickly enough to make for a smooth visualization.

So what I need to do is automatically cluster the data server-side, which I'm hoping can be done in a MongoDB query. I've read that geohashing may be a good starting point to determine which points belong in which clusters, but I'm sure someone has done this exact thing before and might have better insight than just that. Ideally I'd like to send off a query to my node.js script that looks like this:

http://myserver.com/popdata?top=42.48&left=-80.57&bottom=37.42&right=-62.55&stat=Age&value=6_11

which would determine how granular the clustering needs to be based on how many individual points are within that specified geographic area, given a maximum number of data points to return, or something along those lines; and it would return the data like this:

[
    { "clusterlocation": [ 42.304, -72.622 ], "total_age_6_11": 59042 },
    { "clusterlocation": [ 36.255, -64.124 ], "total_age_6_11": 7941 },
    { "clusterlocation": [ 40.425, -70.693 ], "total_age_6_11": 90257 },
    { "clusterlocation": [ 39.773, -67.992 ], "total_age_6_11": 102752 },
    ...
]

...where "clusterlocation" is something like the mean of all locations of documents in the cluster, and "total_age_6_11" is the sum of those documents' values for "Subpopulations.Age.6_11".

Is this something I can do purely in a Mongo query? Is there a "tried and tested" way to do it well?

This would be difficult on just raw data without some preallocated concept of "clustering" either by additional "attrbutes" or simply pre-aggregating to other collection data granular to the "zoom level". The basic issue as I see for single query handling is that whilst you could use a `$geoNear` to determine proximity to a central point ( say center of the area selection ), this "would" give you distance from that point to "cluster" on, however it does not account for point proximity to themselves. So you would basically need to "iterate" point data to find the nearest to each. — Neil Lunn, Apr 15 '16 at 23:49
TLDR of above is, *"without precalculated cluster assignment, this is not very performant"*. — Neil Lunn, Apr 15 '16 at 23:50

DhruvPathak · Accepted Answer · 2016-04-21T17:15:27.440

Even if you do this querying at runtime, it is going to be inefficient and not fast to be considered a good user interface. I would suggest you pregenerate clusters of specific sizes and keep them stored in your current collection along with your original documents. Here is how:

Each document will store an additional field ( lets call it geolevel ), which will denote how small or big entity it is. Your base documents will have geolevel=1 :

{
    "PlaceName" : "Boston",
    "Location" : {
        "type" : "Point",
        "coordinates" : [ 42.358056, -71.063611 ]
    },
    "Subpopulations": {
        "Age": { 
                "0_4" : 37122,
                "6_11" : 33167,
                "12_17" : 35464,
                "18_24" : 130885,
                "25_34" : 127058,
                "34_44" : 79092,
                "45_54" : 72076,
                "55_64" : 59766,
                "65_74" : 33997,
                "75_84" : 20219,
                "85_" : 9057
        }
    },
    "geolevel":1  // added geolevel
}

You can run processing on your database to pre-generate similar documents for clusters, and for multiple levels. e.g. geolevel:2 will be a cluster of few cities within 250kms radius, geolevel:3 will be cluster of geolevel:2 clusters.

You can also store a field like memberids to store ids of children in each cluster. This might be necessary to avoid an entity going into two adjacent clusters, it can be assigned to any one of the adjacent clusters and your visualization would still work fine. A geolevel:2 cluster doc would look like:

 {
    "PlaceName" : "cluster_sdfs34535",  // The id can be generated from hash like sha of a list of all children ids.
    "Location" : {  // center of the cluster
        "type" : "Point",
        "coordinates" : [ 42.358056, -71.063611 ]
    },
    "Subpopulations": { // total population of the cluster
        "Age": { 
                "0_4" : 371220,
                "6_11" : 331670,
                "12_17" : 354640,
                "18_24" : 1308850,
                "25_34" : 1270580,
                "34_44" : 790920,
                "45_54" : 720760,
                "55_64" : 597660,
                "65_74" : 339970,
                "75_84" : 202190,
                "85_" : 90570
        }
    },
    "geolevel":2 ,
    "childs":[4,5,6,7] // ids of child documents
}

Now your visualization app needs to do a mapping of zoomlevel to geolevel, and based on that you will select your documents. For city level visualization, you can query for geolevel:1 documents, and as you zoom out to state,country etc you can increase the geolevel to 2,3...

I really like the approach of building a hierarchy this way. Seems like a fairly straightforward task to build a mechanism for generating these documents. Much appreciated. — DanM, Apr 21 '16 at 16:52

MongoDB: Cluster documents by geographic location given area and max points?

1 Answers1