Optimize large IN condition for Redshift query

Question

I have a ~2TB fully vacuumed Redshift table with a distkey phash (high cardinality, hundreds of millions of values) and compound sortkeys (phash, last_seen).

When I do a query like:

SELECT
    DISTINCT ret_field
FROM
    table
WHERE
    phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae'
    )
AND
    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It returns very quickly. However when I increase the number of hashes beyond 10, Redshift converts the IN condition from a bunch of ORs to an array, per http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists

The problem is when I have a couple dozen phash values, the "optimized" query goes from less than a second response time to over half an hour. In other words it stops using the sortkey and does a full table scan.

Any idea how I can prevent this behavior and retain the use of sortkeys to keep the query quick?

Here is the EXPLAIN difference between <10 hashes and >10 hashes:

Less than 10 (0.4 seconds):

XN Unique  (cost=0.00..157253450.20 rows=43 width=27)
    ->  XN Seq Scan on table  (cost=0.00..157253393.92 rows=22510 width=27)
                Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))

More than 10 (45-60 minutes):

XN Unique  (cost=0.00..181985241.25 rows=1717530 width=27)
    ->  XN Seq Scan on table  (cost=0.00..179718164.48 rows=906830708 width=27)
                Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))

I'm not understanding when you say "it stops using the sortkey and does a full table scan." Redshift always does a full table scan, but it might use the sortkey to skip blocks. Can you provide the exact explain of the query? — Mark Hildreth, Nov 17 '15 at 18:58
No problem @MarkHildreth - I just edited the main post to include the `EXPLAIN` queries. — Harry, Nov 17 '15 at 19:11
Remark, not very fair to SO readers and users (but you can post the solution here): there is a dedicated mailing list for postgresql performance questions. — Str., Nov 20 '15 at 18:21
The actual table definition showing data types and constraints is ***essential*** for a performance question like this. Preferrably a complete `CREATE TABLE` statement, and all relevant index definitions. — Erwin Brandstetter, Dec 02 '15 at 22:03

Lukasz Szozda · Answer 1 · 2015-11-21T14:06:30.717

3

You can try to create temporary table/subquery:

SELECT DISTINCT t.ret_field
FROM table t
JOIN (
   SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
   UNION ALL 
   SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
   UNION ALL
   SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
   -- UNION ALL
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):

SELECT ret_field
FROM table
WHERE phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae')
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION 
-- ...

If query optimizer merge it to one you can try to use temp table for intermediate results

EDIT:

SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
      FROM ...
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

edited Nov 21 '15 at 14:06

answered Nov 17 '15 at 18:49

Lukasz Szozda

162,964
23
234
275

That actually works to continue using the index (thanks!), but I need to return the list of `phash` values from another subquery... it's not manual/coded. Is there a way to use/abuse the UNION ALL from another subquery's result? :( – Harry Nov 17 '15 at 18:53
@Harry You can change `UNION ALL` with anything that returns phash – Lukasz Szozda Nov 17 '15 at 18:56
I've tried that `EDIT` before, and it has the same table scanning effect. And I can't break it into chunks because the hashes all come back from Redshift in one big batch. – Harry Nov 17 '15 at 19:05
@lad2025, your second variant that splits hashes in small chunks using `SELECT DISTINCT ... UNION ALL SELECT DISTINCT ... UNION ALL ...` is not equivalent to original query in the question. Original query has `DISTINCT` over all values of `ret_field`. Your variant can return duplicates. It seems that you'd need to use `UNION`, not `UNION ALL`. And with `UNION` there is no need for `DISTINCT`s. – Vladimir Baranov Nov 21 '15 at 11:59
@lad2025 This is an interesting idea, but again I can't control the number or manual iterations. I need to construct a query that can handle anywhere from 2-3 rows to tens of thousands of rows. Any ideas? – Harry Nov 21 '15 at 20:12

Vladimir Baranov · Answer 2 · 2015-12-08T00:01:10.553

Do you really need DISTINCT ? This operator could be expensive.

I'd try to use LATERAL JOIN. In the query below the table Hashes has a column phash - this is your big batch of hashes. It could be a temp table, a (sub)query, anything.

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN LATERAL
    (
        SELECT table.ret_field
        FROM table
        WHERE
            table.phash = Hashes.phash
            AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
    ) AS T ON true

It is quite likely that optimizer implements LATERAL JOIN as a nested loop. It would loop through all rows in Hashes and for each row run the SELECT FROM table. The inner SELECT should use index that you have on (phash, last_seen). To play it safe include ret_field into the index as well to make it a covering index: (phash, last_seen, ret_field).

There is a very valid point in the answer by @Diego: instead of putting constant phash values into the query, put them in a temporary or permanent table.

I'd like to extend the answer by @Diego and add that it is important that this table with hashes has index, unique index.

So, create a table Hashes with one column phash that has exactly the same type as in your main table.phash. It is important that types match. Make that column a primary key with unique clustered index. Dump your dozens of phash values into the Hashes table.

Then the query becomes a simple INNER JOIN, not lateral:

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN table ON table.phash = Hashes.phash
WHERE
    table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It is still important that table has index on (phash, last_seen, ret_field).

Optimizer should be able to take advantage of the fact that both joined tables are sorted by phash column and that it is unique in the Hashes table.

I've tried every variation possible for lateral joins and I continually get a syntax error. Are you certain they are supported on Redshift? — Harry, Nov 21 '15 at 17:28
@Harry, no, I'm not sure that Redshift has `LATERAL JOIN`. I saw the Postgres tag and didn't pay attention to Redshift tag. Bad luck. Does Redshift have stored procedures and cursors? Usually cursors are slower than declarative SQL when they do the same thing. But, in this case declarative SQL is not doing index seek for each `phash`, so explicit loop for each `phash` with appending results into a temporary table may be faster overall. — Vladimir Baranov, Nov 21 '15 at 22:39

score 2 · Accepted Answer · edited May 23 '17 at 11:45

It's worth a try to set sortkeys (last_seen, phash), putting last_seen first.

The reason of slowness might be because the leading column for the sort key is phash which looks like a random character. As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift

With this order of the sort key, all columns will be sorted by last_seen, then phash. (What does it mean to have multiple sortkey columns?)

One note is that you have to recreate your table to change the sort key. This will help you to do that.

Simple solution, but this solved it! Still not blazing fast, but apparently sortkeys are horribly inefficient on random strings. — Harry, Dec 08 '15 at 22:11

score 1 · Answer 4 · answered Dec 07 '15 at 18:17

you can get rid of the "ORs" by inserting the data you want into a temp table and joining it with your actual table.

Here's an example (I'm using a CTE because with the tool Im using is hard to capture the plan when you have more than one SQL statement - but go with a temp table if you can)

select * 
from <my_table>
where checksum in 
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)

VERSUS

with foo as
(
    select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
    select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
    select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
    select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select  * 
from <my_table> r 
     join foo f on r.checksum = F.my_key

and here's the plan, as you can see it looks more complex but that's because of the CTE, it wouldn't look that ways on a temp table:

score 1 · Answer 5 · answered Dec 07 '15 at 19:39

Did you try using union for all phash values?

Just like that:

SELECT ret_field 
FROM   table 
WHERE  phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

-- and so on...

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

Optimize large IN condition for Redshift query

5 Answers5