0

Non-partitioned hive managed ORC format table load takes 2 hours whereas the same insert takes more than 5 hours to load the partitioned table. Why dynamic partition on a single column(date) make huge difference in performance? Any suggestion to improve the performance would be great help.

Its about 500 million rows per day and I am trying to load for 3 months. So 3 months to non-partition table takes 2 hours but same to partitioned table takes 5 hours. Both source and target tables are ORC format.

Syed
  • 11
  • 4
  • What takes more time is the map-reduce operation since Hive (Yarn) needs to extract and sort the partitioned column (date) for each row. But 2 hours is even extremely slow. How many input rows do you have and what is your input format? If you upload from HDFS big compressed files, use a splitable compression format like BZIP2 and not GZ or ZIP – Harold Jun 27 '18 at 12:54
  • @Harold Its about 500 million rows per day and I am trying to load for 3 months. So 3 months to non-partition table takes 2 hours but same to partitioned table takes 5 hours. Both source and target tables are ORC format. – Syed Jun 28 '18 at 08:12
  • Thanks so I guess 5 hours is not so bad then; Try with `set hive.vectorized.execution.enabled = true;` – Harold Jun 28 '18 at 08:15
  • @Harold Yes, vectorization is enabled. The partitioning takes 3 hours extra. I what to understand the process behind the extra hours. Could you share any thread or link! – Syed Jun 28 '18 at 10:24
  • There's a good explanation alrady here https://stackoverflow.com/questions/30921515/main-difference-between-dynamic-and-static-partitioning-in-hive – Harold Jun 28 '18 at 10:41

0 Answers0