RethinkDB - Find documents with missing field

Question

I'm trying to write the most optimal query to find all of the documents that do not have a specific field. Is there any better way to do this than the examples I have listed below?

// Get the ids of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).pluck("id")

// Get a count of all documents missing "location"
r.db("mydb").table("mytable").filter({location: null},{default: true}).count()

Right now, these queries take about 300-400ms on a table with ~40k documents, which seems rather slow. Furthermore, in this specific case, the "location" attribute contains latitude/longitude and has a geospatial index.

Is there any way to accomplish this? Thanks!

Jorge Silva · Accepted Answer · 2015-04-21T18:45:52.177

19

A naive suggestion

You could use the hasFields method along with the not method on to filter out unwanted documents:

r.db("mydb").table("mytable")
  .filter(function (row) {
    return row.hasFields({ location: true }).not()
  })

This might or might not be faster, but it's worth trying.

Using a secondary index

Ideally, you'd want a way to make location a secondary index and then use getAll or between since queries using indexes are always faster. A way you could work around that is making all rows in your table have a value false value for their location, if they don't have a location. Then, you would create a secondary index for location. Finally, you can then query the table using getAll as much as you want!

Adding a location property to all fields without a location

For that, you'd need to first insert location: false into all rows without a location. You could do this as follows:

r.db("mydb").table("mytable")
  .filter(function (row) {
    return row.hasFields({ location: true }).not()
  })
  .update({
    location: false
  })

After this, you would need to find a way to insert location: false every time you add a document without a location.

Create secondary index for the table

Now that all documents have a location field, we can create a secondary index for location.

r.db("mydb").table("mytable")
 .indexCreate('location')

Keep in mind that you only have to add the { location: false } and create the index only once.

Use getAll

Now we can just use getAll to query documents using the location index.

r.db("mydb").table("mytable")
 .getAll(false, { index: 'location' })

This will probably be faster than the query above.

Using a secondary index (function)

You can also create a secondary index as a function. Basically, you create a function and then query the results of that function using getAll. This is probably easier and more straight-forward than what I proposed before.

Create the index

Here it is:

r.db("mydb").table("mytable")
 .indexCreate('has_location', 
   function(x) { return x.hasFields('location'); 
 })

Use getAll.

Here it is:

r.db("mydb").table("mytable")
 .getAll(false, { index: 'has_location' })

edited Apr 21 '15 at 18:45

answered Apr 19 '15 at 03:34

Jorge Silva

4,574
1
23
42

I forgot to mention that 'location' does have an index - but it's a geospatial index, so it would never itself be false. It either contains the geo data or is not present at all. – vimfluencer Apr 19 '15 at 08:31
If you really want to make those queries faster, how about this: Create a new field on every document called `hasLocation`. Make it a secondary index. Every field that doesn't have a location will have `hasLocation: false` and then you can use `getAll` with a secondary index. There might be a more elegant way of doing this, but this will solve your problem in the meantime. Would that work? – Jorge Silva Apr 19 '15 at 17:03
Also, quick question: Why do some rows have locations and some don't? What type of data are you working with? It seems that you might need some normalization to distinguish between entries that have a location and places that don't. Maybe there's a better way to structure your data? Maybe you should create two tables depending on the existence of a location? Just a thought. – Jorge Silva Apr 19 '15 at 17:05
1

I'm working with a table full of businesses and their addresses. In theory, all of the businesses should have geocode data - but I'm working against the limit of 2,500 geocoding requests per day on Google's geocoding API. Also, the amount of records is growing every day, so I'm building an incremental process to geocode new addresses and addresses without coordinates daily.Right now the data is pretty sparse, but I actually like the idea of breaking the geodata into it's own table (id, location). Then I'm sure I could run this operation and most geospatial operations much faster. – vimfluencer Apr 19 '15 at 18:51
Ok, that makes total sense. I think creating a different table would definitely work, although you might run into slower queries doing joins later. If that happens, then maybe the approach of having `location` and `hasLocation` might make more sense (since it doesn't require any joins). You could add the `hasLocation: false` as a default and let your Google geocode API handler switch it to `true` once it has updated the location. – Jorge Silva Apr 19 '15 at 21:52
@ambient Just added a new solution, in case you're interested. – Jorge Silva Apr 21 '15 at 18:46
Jorge, the new solution is great! It's very intuitive and optimal in terms of execution time. Since the index is based off of a function, does it still rebuild itself automatically? For example, if I add location data to a document that previously had none, will it automatically fall out of the has_location index? I suppose I could test this pretty easily myself and find out, but I'd like to hear your thoughts too. – vimfluencer Apr 22 '15 at 07:31
Yes, the index will rebuild itself automatically. If you add a new document, it will execute that function on it and save it for future querying. – Jorge Silva Apr 22 '15 at 16:23
How can I do this with multiple conditions? for example 'hasLocation: false' , isCurrent: true' ? – Andrew Oct 15 '16 at 23:19
SO question: http://stackoverflow.com/questions/40065264/rethinkdb-hasfields-to-find-all-documents-with-multiple-multiple-missing-condi – Andrew Oct 15 '16 at 23:27

RethinkDB - Find documents with missing field

1 Answers1

Linked