1

I have a table with 4 columns with col4 as the partition column in Hive. This is a huge table with ~9M rows inserted every 5 hours. I have a restriction that I cannot change the design of this table as it is used for other reports as well.

CREATE EXTERNAL TABLE testdb.table1(col1 string, col2 int, col3 int)
PARTITIONED BY (col4 string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/input/';

For one of the use cases, I'm trying to create a lookup table to identify some data points for each value in col4 like

select col1, col4 from testdb.table1 group by col1, col4;

Questions:

  1. Will there be any performance hit by doing a Group By on partitioned column? Col4 in this case.

  2. Will it still be a full table scan when I do a Group By on col4?

  3. What is the best way to create a lookup on a partitioned column with minimal full table scan?

I came across this post but it was for Impala. I'm not sure if Impala and Hive use the same MR Engine internally for data processing. So posting this as a new question.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
underwood
  • 845
  • 2
  • 11
  • 22

1 Answers1

1
  1. Switch on mapper-side aggregation for the best performance:

set hive.map.aggr=true;

and compare performance with partitioned and non-partitioned column. In case of partitioned column the data is already partially grouped (files belong to single partition) and map-side aggregation will be performed a bit faster because mappers will create less groups each and less data will be passed to reducers. In other words, all the records needed to perform aggregation is inside the single partition and each file contains only one group (few groups if group by non-partition column also). But over-partitioning may result in too many files and performance degradation.

  1. Why shouldn't it be a full scan if your query requires full scan? Yes it will be full scan for sure because you have no WHERE clause and selecting not only partition column.

  2. Full table scan means no partition pruning. Add WHERE condition on partition column to restrict partitions. If you use only partition columns in the query, technically it can use metadata only, but this is rare case and not like your query.

You can benefit from partitioning best if you are using partition in the WHERE clause.

Using partitioning and DISTRIBUTE+SORT while loading compressed and ORC table you can significantly reduce the size of compressed files (2x or even more), but I have never noticed SIGNIFICANT performance gain from it.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you for this detailed explanation. Im already using the hive.map.aggr=true for my queries. Along with it, im as well increasing the number of reducers. – underwood Sep 12 '19 at 01:35
  • Can you please explain more on " all the records needed to perform aggregation is inside the single partition and each file contains only one group (few groups if group by non-partition column also)." – underwood Sep 12 '19 at 01:36
  • @underwood this means that mappers will process the same groups and generate smaller size partial pre-aggregated data which is copied to reducers faster and being processed also faster. – leftjoin Sep 12 '19 at 05:45
  • @leftjoin I am currently inspecting the performance advantage of partitioning used with "group by", and you state that: `In case of partitioned column the data is already partially grouped (files belong to single partition)` - but "belonging to single partition" does not mean belonging to the same node, since partition is just a virtual HDFS folder. The files inside of same partition might be scattered all across the cluster. What's the advantage of aggregating a partition then? – mangusta Apr 02 '20 at 13:40
  • @leftjoin I understand that `set hive.map.aggr=true;` acts as Hadoop combiner (probably even generates one) for decreasing mapper's output, but I can't understand why it should give a better performance for partitioned table compared to non-partitioned one? – mangusta Apr 02 '20 at 13:43
  • To give you better understanding of what I mean, let's say we have 2 files (file1 and file2) in the same partition. file1 has 3 blocks on node1, node2, node3. file2 has 3 blocks on node4, node5, node6. How may Hive take advantage of the fact that both files belong to the same partition? It will still need 6 mappers on 6 nodes, just like in case of non-partitioned table – mangusta Apr 02 '20 at 13:53
  • 1
    ah, I think I'm starting to understand it. In case of `col1, col4` example above, mapper will already know that `col4` has only one value across its block, so if we use aggregation, then mapper with simply find distinct values for `col1` and combine them with that single value of `col4`. In non-partitioned table, mapper would have to find distinct values of both `col1` and `col4`. So aggregation on partitioned data is quicker and produces less amount of input for reducers. – mangusta Apr 02 '20 at 14:57
  • @mangusta makes sense. Partition column is not stored in data files, values are taken from metadata. Location name is not necessarily is in key=value format, it can be altered. Partition columns and partition locations are taken from metadata – leftjoin Apr 02 '20 at 16:38
  • @leftjoin does my reasoning in the last comment make sense? (the one about col1,col4) – mangusta Apr 03 '20 at 11:50