Are there any recommendations to speed up query performance for joining two very large hive tables (> 2 TB) ? Execution engine used is Tez . Both the tables are unpartitioned and in text format. Cluster is having 64 nodes with 128 GB ram each.
Asked
Active
Viewed 381 times
0
-
1The smaller the data volumes, the faster the query - therefore, columnar. – David דודו Markovitz Mar 17 '17 at 08:19
-
thanks Dudu...are you suggesting to use columnar data format for the tables like Parquet or ORC ? – KBR Mar 17 '17 at 08:27
-
Definitely. In addition -https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution. (Currently for ORC, I think there is a similar project for Parquet) – David דודו Markovitz Mar 17 '17 at 08:34