Apache Tez tasks on hold at the Application Master

Question

I have a tez problem, when running about 14 queries at the same time, some of them get delays of more than 5 minutes, but the cluster utilization is just 14%.

This is the message that I am talking about.

INFO SessionState: [HiveServer2-Background-Pool: Thread-322319]: Get Query Coordinator (AM) 308.84s

My configuration is the following:

yarn.scheduler.maximum-allocation-mb=188000 
yarn.app.mapreduce.am.resource.mb = 16000 
tez.am.resource.memory.mb = 8000
hive.tez.container.size = 8192
tez.runtime.io.sort.mb 2048 
tez.am.launch.cmd-opts default - .8
tez.runtime.unordered.output.buffer.size-mb= 800 
hive.server2.tez.sessions.per.default.queue = 2 
tez.session.am.dag.submit.timeout.secs = 900  
tez.am.session.min.held.containers=8
tez.am.resource.memory.mb = 8000
hive.prewarm.enabled = TRUE

This is a 15 node cluster, 254GB ram p/node, 32 cores p/node.

Any clue what might be happening? Is the AM well sized? I don't have out of memory errors, just this long wait times when everything is running, but they are processing only 35 million records when they are all together.

Thanks

5min is default duration for a tez session to be closed when there is no execution. `308.84` sec; i see that execution is `8.84` sec and then it is idle. I think you need to specify `hive.server2.tez.default.queues` as well. This will allow you to run the queries concurrently. For each queue name, there will be number of `hive.server2.tez.sessions.per.default.queue`. Also `hive.server2.tez.initialize.default.sessions=true` which initialize default sessions. Please see https://community.cloudera.com/t5/Community-Articles/Hive-Understanding-concurrent-sessions-queue-allocation/ta-p/247407 — Sercan, Dec 29 '21 at 15:33

score 0 · Accepted Answer · answered Jan 27 '22 at 14:44

There is a behavior that is not really well explained in the documentation, the fact that in order to really utilize the cluster and all your additional memory configurations you MUST set up default queues, and you need to specify them when you are going to query, or to connect spark, etc.

For example, when using tez, you need to use the tez.name.queue={your queue name} in order to fully utilize it, this enables parallelism in yarn.

For spark, you need to specify the --queue {your queue name} when launching pyspark, or when submitting jobs using the spark_submit.

In order to use the above, you need to have queues set up in yarn using the hive.server2.tez.default.queues, parameter that you need to set up with the list of default queues for tez. It is important to note that you can create the queues and not list them as default, by doing that you need need to call out the queue manually all the time and the queries are not going to get into any default queue.

Apache Tez tasks on hold at the Application Master

1 Answers1

Linked