How to accelerate large hive table spark group by query?

Question

I have an input table intab:

create table intab (
  ds string comment 'date partition filed'
  , id1 string comment 'id1'
  , id2 string comment 'id2'
  , n int comment 'n'
) comment 'test'
partition by list(ds)(partition default);

I need to calculate the outtab:

create table outtab as select
  id1, id2, sum(n) as sum_n
from intab group by id1, id2;

intab and outtab are both stored as RCFILE in hive.

When using spark to calculate the outtab from large input, I frequently encouters errors like "Error: TransportResponseHandler: Still have 1 requests outstanding when connection" and the spark tasks fail.

Similar to the problem described in https://forums.databricks.com/questions/10872/error-transportresponsehandler-still-have-1-reques-1.html, the errors only appear on large input i.e. 15TB.

The input tab is very large with over 500 billion records and 15 TB storage size.

I configured the following spark parameters, but they did not help:

spark.sql.shuffle.partitions=100000
spark.blacklist.enabled=true
spark.network.timeout=600s
spark.sql.broadcastTimeout=1000
spark.driver.maxResultSize=2g
spark.executor.memoryOverhead=2048
spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false

And I have checked that there is no data skew on the key pairs (id1, id2) that are used in the group by statement.

Any help on the problem would be much needed, e.g. by optimising the storage or partition structure, or spark ops etc....

++

There are over 1 billion distinct values for both id1 and id2 in the input data table: intab.

How to accelerate large hive table spark group by query?

0 Answers0