3

UPDATE: Looking at this more, I'm thinking that I just need to denormalize the data more, and simply add the relevant location information (city/state/etc.) to each row of the referrals table, rather than doing a join on a large set of ZIP codes. Am I on the right track?

Original question:

I have two tables, one containing zip code information and a referrals table which contains about 4 million rows, each with a ZIP code and some other info. Given a particular ZIP code, I want to get all the ZIPs in the same state and/or city (just state in the below example), and then pull all the rows from the referrals table that both match those ZIPs and some other category data, which comes from some other parts of the system.

I'm using this query below, which works, but takes about 10 seconds to run.

r.table('all_geocodes').getAll('77019',
            {index:'postal_code'}).pluck('state_name')
  .eqJoin(r.row('state_name'), r.table('all_geocodes'),
      {index: 'state_name'}).zip().pluck('location_id').distinct()

.eqJoin(function(v) { 
         return [v('location_id'), '207P00000X', 1]; 
   }, 
   r.table('fact_referrals_aggregates'),
     {index: 'location_and_taxonomy_and_entity_type_code'})
.zip().count()

A few relevant numbers:

  • The first part of the query returns about 2700 ZIP codes.
  • getAll(['207P00000X', 1]) on the referrals returns about 100000 rows, and doing an inner join on that vs. the ZIP codes is much slower than what I have.
  • It also has about 100000 rows that match on just the ZIP codes.
  • There are about 8000 rows that match on both the ZIP code and the ['207P00000X', 1] parameters.

As you can see from the query analyzer output below, ReDB is inserting a concatmap which is taking up all of the time. This is my first day using RethinkDB so I'm sure there's something I'm missing but don't know what it is. Any suggestions on how to improve the performance on this query? Is it possible to avoid the concatMap by structuring things differently?

    {
    "description": "Evaluating count.",
    "duration(ms)": 9428.348954,
    "sub_tasks": [
      {
        "description": "Evaluating zip.",
        "duration(ms)": 9394.828064,
        "sub_tasks": [
          {
            "description": "Evaluating eq_join.",
            "duration(ms)": 9198.099333,
            "sub_tasks": [
              {
                "description": "Evaluating concatmap.",
                "duration(ms)": 9198.095406,
                "sub_tasks": [
                  {
                    "description": "Evaluating distinct.", // Distinct ZIP codes
                    "duration(ms)": 114.880663,
                    "sub_tasks": [
                      { *snipped for brevity* },
                      {
                        "description": "Evaluating elements in distinct.",
                        "mean_duration(ms)": 0.001039,
                        "n_samples": 2743
                      }
                    ]
                  },
                  {
                    "description": "Evaluating func.",
                    "duration(ms)": 0.004119,
                    "sub_tasks": []
                  },
                  {
                    "description": "Evaluating stream eagerly.",
                    "mean_duration(ms)": 1.0934,
                    "n_samples": 8285
                  }
                ]
              }
            ]
          },
          {
            "description": "Evaluating stream eagerly.",
            "mean_duration(ms)": 0.001005,
            "n_samples": 8285
          }
        ]
      }
    ]
  • 1
    I'd agree with your conclusion. Denormalization is probably the way to go! I would just add a state field to all rows in the `fact_referrals_aggregates` table and then query through that. I'd add an answer with the queries you'd need to write, but you seem like you can totally handle that :) – Jorge Silva Apr 19 '15 at 03:58
  • Thanks Jorge. Yes, it did help a lot! – Daniel Imfeld Apr 19 '15 at 07:29

1 Answers1

2

Denormalization ended up being the way to go here. I added fields for state and city to the referrals table, eliminating the massive join against the ZIP table, and speed improve greatly.

  • 2
    Glad you figured it out! That's one of the added advantages of NoSQL + joibs, you can do a lot of optimizations by going back an forth on normalization. This is a perfect example of this. – Jorge Silva Apr 19 '15 at 16:54