3

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows

SELECT DT, Sum(1) from mydata GROUP BY DT

DT is partition column, a string that marks date.

In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.

When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!) To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.

UPDATE: It finished 1 row selected (672.152 seconds)

More information about the environment:

  • Only one queue used, with capacity scheduler

  • User under which the job is running is my own user. We have Kerberos used with LDAP

  • AM Resource: 4096 MB

  • using tez.runtime.compress with Snappy

  • data is in Parquet format, no compression applied

  • tez.task.resource.memory 6134 MB

  • tez.counters.max 10000

  • tez.counters.max.groups 3000

  • tez.runtime.io.sort.mb 8110 MB

  • tez.runtime.pipelined.sorter.sort.threads 2

  • tez.runtime.shuffle.fetch.buffer.percent 0.6

  • tez.runtime.shuffle.memory.limit.percent 0.25

  • tez.runtime.unordered.output.buffer.size-mb 460 MB

  • Enable Vectorization and Map Vectorization true

  • Enable Reduce Vectorization false

  • hive.vectorized.groupby.checkinterval 4096

  • hive.vectorized.groupby.flush.percent 0.1

  • hive.tez.container.size 682

More Updates:

When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}

Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.

What shd be the first thing to check?

Thx

hummingBird
  • 2,495
  • 3
  • 23
  • 43
  • 1
    Try to find what exactly is running slow: mappers, reducers, check slow containers logs. How many mappers and reducers are running. Also Tez configuration is important. Now the question is too broad. Also it seems you are not using partition statistics for query calculation. For such simple query it should work fast. Better use count(*) instead of sum(1) – leftjoin Nov 19 '18 at 11:20
  • thank you, but this is just a sample query... it's pretty much like this for other types, too. i'll go into other things and update as I find things out – hummingBird Nov 19 '18 at 11:29
  • 1
    What is the file format -- CSV, AVRO, ORC, Parquet? Compressed? Does Hive run the query as `hive` on a different queue than your personal Spark session uses? Tez container size? Etc... – Samson Scharfrichter Nov 19 '18 at 20:15
  • @SamsonScharfrichter added more info to question... What else could be important? – hummingBird Nov 20 '18 at 13:39
  • 1
    `hive.vectorized.execution.enabled`? – Samson Scharfrichter Nov 20 '18 at 17:20
  • My guess is that Spark detects that the query does not require reading any actual value, just scanning the metadata footer of each Parquet file; while TEZ does not support that kind of shortcut _(because it assumes that real users want to run real queries...)_ and actually scans the files – Samson Scharfrichter Nov 20 '18 at 17:26
  • 1
    `hive.tez.container.size`? if not set, `mapreduce.map.memory.mb`? – Samson Scharfrichter Nov 20 '18 at 17:29
  • Adding more to comments... – hummingBird Nov 22 '18 at 11:20
  • 1
    tez is *always* slow even for absolutely trivial **in memory** queries. This is true even in 2022 – WestCoastProjects Oct 23 '22 at 08:51
  • thanks... this went south in the end and we ended up setting up LLAP, which helped, but with a cost - our cluster is on an old version and we're not able to upgrade since it's in production and we have no support anymore. thx anyway, good for someone to see i guess. – hummingBird Nov 07 '22 at 09:33

1 Answers1

0

In the end, we gave up and installed LLAP. I'm going to accept it as an answer, as I have sort of an OCD and this unanswered question has been poking my eyes for long enough.

hummingBird
  • 2,495
  • 3
  • 23
  • 43
  • What is the execution time now? – Ashish Doneriya Jan 12 '21 at 12:19
  • 1
    Well, still not perfect, but much much better. Let's say it cut down the execution time some 2-5 times. However, in the meantime we added sorted buckets in the story. Of course, this gave us the new problem - Spark cannot insert into bucketed hive tables, at least not in version we have :D... So we had to introduce one more job - hourly load from "stage" to "production". long story short, we gave up recently and decided to go with google cloud and big query. Reason is cost of licenses, operational costs and SLAs that we were asked to fill. In the end we pretty much exhausted all the options – hummingBird Jan 13 '21 at 20:55