Why query1 runs for longer time compared to Query2.
Hive Source Table Details
Columns - 166
Partitions columns - 2 columns (all are int datatypes)
Number of Partitions - 3211 partitions
Total Records - 19374461064
File Format - ORC
-- Query1
create table temp1 as
with temp_table1 as
(select col1 from temp_table where col1 between <start_date> and <end_date>), -- these query would get me the dates and there are derived dynamically in the script
select f1.* from sourcetable f1 join temp_table1 f2 on f1.col1=f2.col1;
-- Above query runs for longer time, one of the steps takes ~1300 mappers and ~1000 reducers. -- Execution Time - ~120 Mins
-- Query2
-- if am passing the results of temp table1 in where clause as values, then query retreives results in under ~10mins.
select f1.* from source_table where f1.col1 in (value1, value2, value3, value3... value30);