1

I'm facing slow performance issues with my Spark jobs running on Dataproc when reading data from Google Cloud Storage (GCS) as parquet files. I have run two experiments with different data sizes and observed significant latency. Here are the details:

Experiment 1: Total data size is 232GB, consisting of approximately 11,700 files. The job takes around 4 hours and 20 minutes to complete. I notice approximately 7.5TB of pending YARN memory at the start of the job.

Experiment 2: Total data size is 347GB, consisting of approximately 3,500 files. The job takes around 4 hours to complete. I observe approximately 11TB of pending YARN memory at the start of the job.

Both experiments use the same table data from BigQuery. In the second experiment, I have applied clustering and partitioning to create larger files as recommended in the documentation.

I'm running the jobs on a Dataproc cluster with a master, 2 workers, and 1 spot worker. Each worker has 4 vCPUs and 15GB RAM. I have allocated 4 cores to both the driver and executor, 12GB RAM to the executor, and 7GB RAM to the driver.

During the job execution, I noticed the following log messages indicating potential high latency:

23/07/17 14:52:28 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=359; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history
23/07/17 14:52:28 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
23/07/17 14:52:28 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_mkdirs. latencyMs=213; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history
23/07/17 14:52:29 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_create. latencyMs=129; previousMaxLatencyMs=0; operationCount=1; context=gs://dataproc-temp-us-east1-1049208745052-m7qadli0/31ac51c1-4713-479b-a575-bd6e93bc288a/spark-job-history/application_1689583112984_0011.inprogress
23/07/17 14:52:33 WARN com.google.cloud.hadoop.fs.gcs.GhfsStorageStatistics: Detected potential high latency for operation op_get_file_status. latencyMs=1730; previousMaxLatencyMs=359; operationCount=2; context=gs://my-bucket/test/

Based on this, I have the following questions:

  1. Is it expected for a simple select * query on approximately 300GB data to take 4 hours or more to complete?

  2. What can I do to improve the performance of these Spark jobs?

  3. Is GCS inherently slower when used with Dataproc Spark jobs?

  4. Could creating a temporary view affect the execution time of the job? On the Spark UI, I only see parquet reads/writes as the bottleneck and no issues related to the temporary view.

I initially considered offloading the query from BigQuery to Spark to reduce cost, but the Dataproc Spark jobs are running snail-slow. Any advice or help would be highly appreciated.

Thank you.

kiran mathew
  • 1,882
  • 1
  • 3
  • 10
bha159
  • 211
  • 3
  • 14
  • Did you get any error message when reading data from GCS? Do this [link1](https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide) and [link2](https://stackoverflow.com/questions/53260735/gcp-dataproc-slow-read-speed-from-gcs) help you? – kiran mathew Jul 19 '23 at 07:12
  • @kiranmathew No I didn't get any error message. I got the warnings regarding slowness which I shared in the question. Thanks for sharing these links though, I will check them to see if there is anything I can improve. – bha159 Jul 21 '23 at 08:04

0 Answers0