How can I perform a join/merge of a nested attribute across tables?

Question

I have a table (db name is libstats, table name is flowcells) full of documents, all of which have a structure like this:

{
        "barcode": "C3W9UACXX",
        "id": "0021732f-2c42-4e9a-90fd-c68bb0d998dc",
        "libraries": [
            {
                "bases": 2431000000,
                "library_id": "SL58263",
                "perc_raw_clusters": 5.5,
                "pf_reads": 24312986,
                "q30": 92.23,
                "qscore": 35.82,
                "reads": 25834646,
                "lane": 1
             },...
         ]
}

The objects in the 'libraries' array will always have the keys shown. I need to add another key library_name.

I have another table libraries in another database libraries which has this library_name information. Here's an example document from that table:

{
   library_id: 'SL123456',
   library_name: 'my_library_name'
}

How can I use ReQL to accomplish this? I've gotten this far:

r.db('libstats').table('flowcells').merge(function(flowcell){
  return {'libraries': flowcell('libraries').map(function(library){
    return library.merge(
       {'library_name': 'foo'}
      )
  })
  }
}).limit(1)

which gives output in exactly the structure I want, but all of my attempts to acquire the library_name attribute using getField, eqJoin(), and merge() have thus far proved fruitless:

{
        "barcode": "C6841ANXX",
        "id": "007cae10-de3c-44df-9aee-1de9c88c1c21",
        "libraries": [
            {
                "bases": 194000000,
                "lane": "1",
                "library_id": "SL91807",
                "library_name": "foo",
                "perc_raw_clusters": 0.9,
                "pf_reads": 1942910,
                "q30": 96.55,
                "qscore": 36.06,
                "reads": 2045599
            },
        ]
    }

Jorge Silva · Accepted Answer · 2015-07-08T20:06:07.050

2

Naive Implementation

You can do the following:

r.db('libstats').table('flowcells').merge(function(flowcell){
  return {'libraries': flowcell('libraries').map(function(library){
    return library.merge({ 
      // Query the `libraries` table
      'library_name': r.db('libraries').table('libraries')
        // Filter out certain elements
        .filter(function (row) {
          // Return all elements where the `library_id` is equal to 
          // the `library_id` in the `libstats` table
          return row('library_id').eq(library('library_id'))
        })
        // Return the `library_name` for the first element
        (0)('library_name')
    })
  })
  }
})

Keep in mind that you can also use a secondary index for this and make this a bit simpler and more performant.

Better Solution

If you have a lot of documents (10K+), you'll want to create an index on library_id and use the following query:

r.table('libstats').merge(function(flowcell){
  return {'libraries': flowcell('libraries').map(function(library){
    return library.merge({ 
      // Query the `libraries` table
      'library_name': r.table('libraries')
        // Filter out certain elements
      .getAll(library('library_id'), { index: 'library_id' })(0)('library_name')
    })
  })
  }
})

edited Jul 08 '15 at 20:06

answered Jul 07 '15 at 22:57

Jorge Silva

4,574
1
23
42

1

I'm slapping my forehead because I didn't think of that. Bravo! – DeeDee Jul 07 '15 at 23:29
Though I don't see any reason why this query shouldn't work, it times out after 300 seconds. I have an index set on `library_id` in the `libraries` table of the `libraries` DB. Any idea what I could be doing incorrectly? – DeeDee Jul 08 '15 at 03:20
I tested it out and it was working for me. What happens if you test it in the data explorer? Are you logging errors to make sure everything is working correctly? What happens if you start removing parts of the query and seeing when it works? – Jorge Silva Jul 08 '15 at 16:45
The Data Explorer was where I tested it and got the timeout error. The query works fine up until the filter part. For example, the query I show in my question works fine (and very quickly) I see exactly why the `filter()` function *should* work. The 'libraries' table has ~16K entries, but it's indexed on `library_id`. Not sure if the number of entries could be the reason for the delay. – DeeDee Jul 08 '15 at 19:20
Yes, the query was probably too slow (because it need to go through the table 10K+ times). I updated my answer. Keep in mind that indexes have to be used explicitly through `get`, `getAll`, or `between`. – Jorge Silva Jul 08 '15 at 20:08
Very nice. Newbie me keeps thinking there should be some magical direct way, but your updated solution makes perfect sense. Thanks so much for following up! – DeeDee Jul 08 '15 at 21:23
There's a GitHub issue for that if you want to weigh in: https://github.com/rethinkdb/rethinkdb/issues/2356 – Jorge Silva Jul 08 '15 at 21:37
its weird that you need to use round brackets in stead of square brackets to get the first element but okay it works thanks ;) – webmaster Oct 09 '16 at 19:46
@webmaster Think if of it this way. You're not writiing, you're writing ReQL and ReQL turns the code you write into a TCP request that it sends to the server. Hence, it doesn't ACTUALLY have the date it's manipulating at that point. – Jorge Silva Oct 11 '16 at 20:15

score 0 · Answer 2 · answered Dec 16 '16 at 20:59

Here is a shorter version of a NAIVE IMPLEMENTATION in case someone has a similar problem:

r.db('libstats').table('flowcells').merge(function (flowcell) {
    return { libraries: flowcell("libraries").eqJoin("library_id", r.db("libraries").table("libraries") ).zip() };
})

(If there are other fields in libraries beside id and name, they will be joined, too)

How can I perform a join/merge of a nested attribute across tables?

2 Answers2