Redshift large 'in' clause best practices

Question

We have a query in which a list of parameter values is provided in "IN" clause of the query. Some time back this query failed to execute as the size of data in "IN" clause got quite large and hence the resulting query exceeded the 16 MB limit of the query in REDSHIFT. As a result of which we then tried processing the data in batches so as to limit the data and not breach the 16 MB limit. My question is what are the factors/pitfalls to keep in mind while supplying such large data for the "IN" clause of a query or is there any alternative way in which I can deal with such large data for the "IN" clause?

Can you please clarify more on the part of breaking the query? — Gagan Maheshwari, Oct 28 '17 at 08:56
are you providing the in clause literally - i.e. generating the very large sql select statement and then submitting it? and - does this need to run as a single select statement or can you run multiple sql steps? — Jon Scott, Oct 28 '17 at 10:42
hi - did my answer work for you if so mark it as such - if not, please let me know why not. — Jon Scott, Oct 29 '17 at 15:26

score 1 · Answer 1 · answered Oct 28 '17 at 11:13

If you have control over how you are generating your code, you could split it up as follows

first code to be submitted, drop and recreate filter table:

drop table if exists myfilter;
create table myfilter (filter_text varchar(max));

Second step is to populate the filter table in parts of a suitable size, e.g. 1000 values at a time

insert into myfilter
values({{myvalue1}},{{myvalue2}},{{myvalue3}} etc etc up to 1000 values );

repeat the above step multiple times until you have all of your values inserted

Then, use that filter table as follows

select * from master_table
where some_value in (select filter_text from myfilter);
drop table myfilter;

Hi Jon, isn't redshift quite slow in terms of insertions? Won't this approach slow down the whole process? — Gagan Maheshwari, Nov 01 '17 at 05:04
Yes it will be slower, however I cannot think of a faster alternative for very large in clauses. — Jon Scott, Nov 01 '17 at 06:20

AlexYes · Answer 2 · 2017-10-30T16:44:49.493

0

Large IN is not the best practice itself, it's better to use joins for large lists:

construct a virtual table a subquery
join your target table to the virtual table

like this

with
your_list as (
    select 'first_value' as search_value
    union select 'second_value'
    ...
)
select ...
from target_table t1
join your_list t2
on t1.col=t2.search_value

edited Oct 30 '17 at 16:44

answered Oct 28 '17 at 12:31

AlexYes

4,088
2
15
23

I am not sure whether I can go by this approach as the data we pass on in "IN" clause is retrieved after fetching it from some other table and performing a couple of operations/validation on them. This processing part is written in Java currently. Reason, I am reluctant to move this processing part to sql is that it will increase the query complexity immensely which will impact the query's understandability. – Gagan Maheshwari Nov 01 '17 at 05:08
@GaganMaheshwari if I understand this correctly, there's no problem and you can generate this subquery in Java - instead of just sending the list send the subquery – AlexYes Nov 01 '17 at 09:06
Can you elaborate more on generating the subquery in Java? – Gagan Maheshwari Nov 06 '17 at 05:57
@GaganMaheshwari If you say you have the processing part in Java I assumed you mean you have a Java app that is talking to Redshift back and forth, and you're composing the "IN" query in Java and send it to Redshift for execution. Then you're free to use this list to construct the query the way above, not just "IN" with the list. – AlexYes Nov 06 '17 at 09:04

Redshift large 'in' clause best practices

2 Answers2