1

Why query1 runs for longer time compared to Query2.

Hive Source Table Details

Columns - 166
Partitions columns - 2 columns (all are int datatypes)
Number of Partitions - 3211 partitions
Total Records - 19374461064
File Format - ORC

-- Query1

create table temp1 as 
with temp_table1 as 
(select col1 from temp_table where col1 between <start_date> and <end_date>), -- these query would get me the dates and there are derived dynamically in the script

select f1.* from sourcetable f1 join temp_table1 f2 on f1.col1=f2.col1;

-- Above query runs for longer time, one of the steps takes ~1300 mappers and ~1000 reducers. -- Execution Time - ~120 Mins

-- Query2

-- if am passing the results of temp table1 in where clause as values, then query retreives results in under ~10mins.

select f1.* from source_table where f1.col1 in (value1, value2, value3, value3... value30);
Vijiy
  • 1,187
  • 6
  • 21
  • I faced the same issue and as the workaround I prepared list of values in one query and passed it into second one as a parameter: https://stackoverflow.com/a/56963448/2700344 When nothing else works, this workaround can help – leftjoin Jun 15 '21 at 07:36
  • 1
    @leftjoin - Thank you, Yes for now even I doing the same. getting the filter values before and passing them as hivevar to HQL's. It is creating problem when I have multiple filters and filters specific for each client (we have more than 200 clients). – Vijiy Jun 15 '21 at 08:16

0 Answers0