The below is the last message in the logs. I am using spark version 3.1.2
INFO BlockManagerInfo: Removed broadcast_2**_piece0 on *****:32789 in memory
I have 500 million strings in a single column in a big table, lets call it big_table and the big_table is stored in parquet format.
when i do select * from big_table
, according to the logs, the query completes faster (I am assuming this).
but the cpu usage is 100% and it stays at 100% for long. I suspect that since there are lot of repeated strings (there are 7.7 million unique strings), we have to deserialize and uncompress 7.7 million strings to 500 million strings. I am hypothesizing that this would cause high memory and cpu usage observed in the below image.
I am submitting SQL queries to spark via Spark Thrift Server. The below is the image of the spark master's htop view, while the job is presumably stuck (may be stuck is not the correct word).