I have a table with 4 columns with col4 as the partition column in Hive. This is a huge table with ~9M rows inserted every 5 hours. I have a restriction that I cannot change the design of this table as it is used for other reports as well.
CREATE EXTERNAL TABLE testdb.table1(col1 string, col2 int, col3 int)
PARTITIONED BY (col4 string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/input/';
For one of the use cases, I'm trying to create a lookup table to identify some data points for each value in col4 like
select col1, col4 from testdb.table1 group by col1, col4;
Questions:
Will there be any performance hit by doing a Group By on partitioned column? Col4 in this case.
Will it still be a full table scan when I do a Group By on col4?
What is the best way to create a lookup on a partitioned column with minimal full table scan?
I came across this post but it was for Impala. I'm not sure if Impala and Hive use the same MR Engine internally for data processing. So posting this as a new question.