2

Below is the query which I am trying to run on Hive with execution engine as tez.

SELECT A.CITY, A.NAME, B.PRICE
  ,(ROW_NUMBER() OVER (PARTITION BY A.NAME ORDER BY B.PRICE) ) AS RNUM
  FROM TABLE1 A
  LEFT JOIN TABLE2 B 
   ON A.NAME = B.NAME 
   WHERE ( A.COLUMN2 >= B.COLUMN3 AND A.COLUMN2 < B.COLUMN4)
GROUP BY A.CITY, A.NAME, B.PRICE;
  • When I run above query on my data in Hive it keeps running for hours without any results, but syntactically above query is correct.
  • Both TABLEA and TABLEB has millions of records.

I tried changing the data format, increasing the container size, changing the number of reducers and changing heap size. Whatever parameter I change the query is being stuck.

In my further investigation I noticed that the where condition and window function are causing the query to run infinitely.

Here is my question:

  • What is the change that will make the query run end to end on huge data?

Thanks for your help

hadooper
  • 726
  • 1
  • 6
  • 18

1 Answers1

0

I guess this is not caused by memory allocation or reducer count. Could be caused by data skew. Analyze from that point also. This link would help: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

The link below gives insights on skewed tables and list bucketing. Don't miss to read.

https://cwiki.apache.org/confluence/display/Hive/ListBucketing

Thanks!

Marco99
  • 1,639
  • 1
  • 19
  • 32