Spark SQL job stuck at the collect stage in the driver

Question

The below is the last message in the logs. I am using spark version 3.1.2

INFO BlockManagerInfo: Removed broadcast_2**_piece0 on *****:32789 in memory

I have 500 million strings in a single column in a big table, lets call it big_table and the big_table is stored in parquet format.

when i do select * from big_table, according to the logs, the query completes faster (I am assuming this).

but the cpu usage is 100% and it stays at 100% for long. I suspect that since there are lot of repeated strings (there are 7.7 million unique strings), we have to deserialize and uncompress 7.7 million strings to 500 million strings. I am hypothesizing that this would cause high memory and cpu usage observed in the below image.

I am submitting SQL queries to spark via Spark Thrift Server. The below is the image of the spark master's htop view, while the job is presumably stuck (may be stuck is not the correct word).

Please see [ask] and edit your question with details. Which answer do you expect with the information that you provide? Even the screenshot doesn't provide any information that "all cpus are running" doesn't provide, but we don't even know if it's indeed spark using up the CPUs or if it's Chrome. — Olaf Kock, Jul 30 '21 at 07:27
It is spark master using up the CPU, there is no chrome running on the above node. — rk1234 afbce, Aug 02 '21 at 06:17
@OlafKock Please check the question again. I have updated the details. Thanks for looking into it :) — rk1234 afbce, Aug 03 '21 at 11:38

Spark SQL job stuck at the collect stage in the driver

0 Answers0