3

I have two models in my MySQL database: Users and Posts

users have geolocation attributes (lat/long)

posts simply have a body of text.

I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.

This is my current solution:

Index the posts and have a geolocation attribute in each document. When a user changes location, run an elasticsearch batch update on all that user's posts, and modify the geolocation attribute on those documents.

Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.

Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?

TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
  • Can you clarify whether you search only the posts from a user around a given location or any posts by any users who are around a given location? – Val Mar 12 '16 at 04:29
  • @Val I want to search posts (matching body text), but filter by the user's location at any given time (real-time). The user's location is always changing. – TIMEX Mar 12 '16 at 05:24
  • Ok, and all posts by a user X will always be co-located to where that user is currently now and not where he was when the post was created, right? – Val Mar 12 '16 at 05:30
  • @val - right - where he is not (not when the post was created) – TIMEX Mar 12 '16 at 21:59
  • Thanks, and posts have a `user_id` field containing the id of the user who posted, right? Another question is whether your query might return posts from different users who are all around a given location (+ filtered by body text)? – Val Mar 14 '16 at 06:18

1 Answers1

6

Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.

Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.

It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.

The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:

POST /users/_search?filter_path=hits.hits._id
{
  "size": 1000,
  "_source": false,
  "query": {
    "bool": {
      "filter": [
        {
          "geo_distance": {
            "distance": "100m",
            "location": {
              "lat": 32.5362723,
              "lon": -80.3654783
            }
          }
        }
      ]
    }
  }
}

You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.

POST /posts/_search
{
  "size": 50,
  "query": {
    "bool": {
      "must": {
        "match": {
          "body": "some text"
        }
      },
      "filter": [
        {
          "terms": {
            "user_id": [ 1, 2, 3, 4 ]
          }
        }
      ]
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • What if the radius is 300 miles, and there are simply too many users within this radius? – TIMEX Mar 17 '16 at 17:54
  • Can you define "too many" ([hint as to how many that might be](http://stackoverflow.com/a/26719362/4604579))? If you have the need to specify 300 miles, then maybe the location filtering should not be that relevant anymore anyway. – Val Mar 17 '16 at 22:39