On Dataproc simple spark job is very slow

Question

I'm facing slow performance issues with my Spark jobs running on Dataproc when reading data from Google Cloud Storage (GCS) as parquet files. I have run two experiments with different data sizes and observed significant latency. Here are the details:

Experiment 1: Total data size is 232GB, consisting of approximately 11,700 files. The job takes around 4 hours and 20 minutes to complete. I notice approximately 7.5TB of pending YARN memory at the start of the job.

Experiment 2: Total data size is 347GB, consisting of approximately 3,500 files. The job takes around 4 hours to complete. I observe approximately 11TB of pending YARN memory at the start of the job.

Both experiments use the same table data from BigQuery. In the second experiment, I have applied clustering and partitioning to create larger files as recommended in the documentation.

I'm running the jobs on a Dataproc cluster with a master, 2 workers, and 1 spot worker. Each worker has 4 vCPUs and 15GB RAM. I have allocated 4 cores to both the driver and executor, 12GB RAM to the executor, and 7GB RAM to the driver.

During the job execution, I noticed the following log messages indicating potential high latency:

23/07/17 14:52:28 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=359; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history
23/07/17 14:52:28 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
23/07/17 14:52:28 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_mkdirs. latencyMs=213; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history
23/07/17 14:52:29 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_create. latencyMs=129; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history/application_1689583112984_0011.inprogress
23/07/17 14:52:33 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=1730; previousMaxLatencyMs=359; operationCount=2; context=gs://my-bucket/test/

Based on this, I have the following questions:

Is it expected for a simple select * query on approximately 300GB data to take 4 hours or more to complete?
What can I do to improve the performance of these Spark jobs?
Is GCS inherently slower when used with Dataproc Spark jobs?
Could creating a temporary view affect the execution time of the job? On the Spark UI, I only see parquet reads/writes as the bottleneck and no issues related to the temporary view.

I initially considered offloading the query from BigQuery to Spark to reduce cost, but the Dataproc Spark jobs are running snail-slow. Any advice or help would be highly appreciated.

Thank you.

Did you get any error message when reading data from GCS? Do this [link1](https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide) and [link2](https://stackoverflow.com/questions/53260735/gcp-dataproc-slow-read-speed-from-gcs) help you? — kiran mathew, Jul 19 '23 at 07:12
@kiranmathew No I didn't get any error message. I got the warnings regarding slowness which I shared in the question. Thanks for sharing these links though, I will check them to see if there is anything I can improve. — bha159, Jul 21 '23 at 08:04

On Dataproc simple spark job is very slow

0 Answers0