I'm facing slow performance issues with my Spark jobs running on Dataproc when reading data from Google Cloud Storage (GCS) as parquet files. I have run two experiments with different data sizes and observed significant latency. Here are the details:
Experiment 1: Total data size is 232GB, consisting of approximately 11,700 files. The job takes around 4 hours and 20 minutes to complete. I notice approximately 7.5TB of pending YARN memory at the start of the job.
Experiment 2: Total data size is 347GB, consisting of approximately 3,500 files. The job takes around 4 hours to complete. I observe approximately 11TB of pending YARN memory at the start of the job.
Both experiments use the same table data from BigQuery. In the second experiment, I have applied clustering and partitioning to create larger files as recommended in the documentation.
I'm running the jobs on a Dataproc cluster with a master, 2 workers, and 1 spot worker. Each worker has 4 vCPUs and 15GB RAM. I have allocated 4 cores to both the driver and executor, 12GB RAM to the executor, and 7GB RAM to the driver.
During the job execution, I noticed the following log messages indicating potential high latency:
23/07/17 14:52:28 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=359; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history
23/07/17 14:52:28 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
23/07/17 14:52:28 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_mkdirs. latencyMs=213; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history
23/07/17 14:52:29 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_create. latencyMs=129; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history/application_1689583112984_0011.inprogress
23/07/17 14:52:33 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=1730; previousMaxLatencyMs=359; operationCount=2; context=gs://my-bucket/test/
Based on this, I have the following questions:
Is it expected for a simple select * query on approximately 300GB data to take 4 hours or more to complete?
What can I do to improve the performance of these Spark jobs?
Is GCS inherently slower when used with Dataproc Spark jobs?
Could creating a temporary view affect the execution time of the job? On the Spark UI, I only see parquet reads/writes as the bottleneck and no issues related to the temporary view.
I initially considered offloading the query from BigQuery to Spark to reduce cost, but the Dataproc Spark jobs are running snail-slow. Any advice or help would be highly appreciated.
Thank you.