UPDATE: Looking at this more, I'm thinking that I just need to denormalize the data more, and simply add the relevant location information (city/state/etc.) to each row of the referrals table, rather than doing a join on a large set of ZIP codes. Am I on the right track?
Original question:
I have two tables, one containing zip code information and a referrals table which contains about 4 million rows, each with a ZIP code and some other info. Given a particular ZIP code, I want to get all the ZIPs in the same state and/or city (just state in the below example), and then pull all the rows from the referrals table that both match those ZIPs and some other category data, which comes from some other parts of the system.
I'm using this query below, which works, but takes about 10 seconds to run.
r.table('all_geocodes').getAll('77019',
{index:'postal_code'}).pluck('state_name')
.eqJoin(r.row('state_name'), r.table('all_geocodes'),
{index: 'state_name'}).zip().pluck('location_id').distinct()
.eqJoin(function(v) {
return [v('location_id'), '207P00000X', 1];
},
r.table('fact_referrals_aggregates'),
{index: 'location_and_taxonomy_and_entity_type_code'})
.zip().count()
A few relevant numbers:
- The first part of the query returns about 2700 ZIP codes.
getAll(['207P00000X', 1])
on the referrals returns about 100000 rows, and doing an inner join on that vs. the ZIP codes is much slower than what I have.- It also has about 100000 rows that match on just the ZIP codes.
- There are about 8000 rows that match on both the ZIP code and the
['207P00000X', 1]
parameters.
As you can see from the query analyzer output below, ReDB is inserting a concatmap which is taking up all of the time. This is my first day using RethinkDB so I'm sure there's something I'm missing but don't know what it is. Any suggestions on how to improve the performance on this query? Is it possible to avoid the concatMap by structuring things differently?
{
"description": "Evaluating count.",
"duration(ms)": 9428.348954,
"sub_tasks": [
{
"description": "Evaluating zip.",
"duration(ms)": 9394.828064,
"sub_tasks": [
{
"description": "Evaluating eq_join.",
"duration(ms)": 9198.099333,
"sub_tasks": [
{
"description": "Evaluating concatmap.",
"duration(ms)": 9198.095406,
"sub_tasks": [
{
"description": "Evaluating distinct.", // Distinct ZIP codes
"duration(ms)": 114.880663,
"sub_tasks": [
{ *snipped for brevity* },
{
"description": "Evaluating elements in distinct.",
"mean_duration(ms)": 0.001039,
"n_samples": 2743
}
]
},
{
"description": "Evaluating func.",
"duration(ms)": 0.004119,
"sub_tasks": []
},
{
"description": "Evaluating stream eagerly.",
"mean_duration(ms)": 1.0934,
"n_samples": 8285
}
]
}
]
},
{
"description": "Evaluating stream eagerly.",
"mean_duration(ms)": 0.001005,
"n_samples": 8285
}
]
}
]