I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT
is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!) To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Enable Vectorization and Map Vectorization true
Enable Reduce Vectorization false
hive.vectorized.groupby.checkinterval 4096
hive.vectorized.groupby.flush.percent 0.1
hive.tez.container.size 682
More Updates:
When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain
. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx