0

Update

When loading the files using DataFrame I achieved far superior performance. Haven't had a chance to look into why this, but reading like this then converting to RDD is the best solution I've found so far.

sparkSession.read.text("gs://bucket/some/sub/directory/prefix*")

I'm trying to do a simple read of files in a GCS bucket into Dataproc Spark, and insert into a Hive table. I'm getting very poor network bandwidth (max 1MB/s) on downloading the files from the bucket.

Cluster: 3 x n1-standard-4 (one is master).

Bucket has 1440 GZIPed items, approx. 4MB each.

I am loading into spark using

sc.textFile("gs://bucket/some/sub/directory/prefix*")

My dataproc cluster and GCS bucket are in the same region/zone. The bucket is regional only (not multi-regional).

I have observed that increasing the size of my cluster will increase my maximum network bandwidth, but I don't want to use a massive cluster just to get decent network bandwidth.

If I were to download the same data using gsutil cp (running on the dataproc master VM instance) it takes ~30 seconds only.

Is there some setting I am missing, or perhaps the sc.textFile(...) approach is highly un-optimal for GCS?

Thanks

Daniel Messias
  • 2,623
  • 2
  • 18
  • 21
  • How many spark cores you are having in that spark application? As far as I understand your input will be converted into RDD with 1440 partitions (because gzips are not splittable), and these 1440 tasks would be scheduled across your available spark cores. With cluster of your size it could be 8 cores, i.e. 8 parallel tasks, so total time would be around 1400/8=180 times of one task execution. – chemikadze Jun 03 '18 at 18:23
  • I've ran it with all sorts of different sizes clusters, with seemingly linear performance increase as the cores increases, yet the base speed was just too slow. See above edit to post for the best solution I've found so far. – Daniel Messias Jun 04 '18 at 15:52
  • You may want to use latest GCS connector 1.9.6 that have significant improvements to IO performance (https://github.com/GoogleCloudPlatform/bigdata-interop/releases/tag/v1.9.6). You can update GCS connector on Dataproc cluster using init action: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/connectors – Igor Dvorzhak Sep 06 '18 at 16:24

1 Answers1

0

This blog post should answer a few questions about RDD vs DataFrame performance differences: https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash

More generally:

  • GCS IO performance can vary depending on load
  • IO performance can vary between GCE zone VMs are in
  • IO depends on number of CPUs and disk size

In my own testing for this post, gsutil cp to local disk was slowest while various distributed commands were significatly faster on a dataset simiar to yours (1440 text files of 4mb of random data):

import timeit
i1 = sc.textFile("gs://my-bucket/input/*")

// Ordered by fastest first to slowest last:
timeit.timeit(lambda: spark.read.text("gs://.../input/*").count(), number=1)

timeit.timeit(lambda: i1.count(), number=1)

timeit.timeit(lambda: spark.read.text("gs://.../input/*").rdd.count(), number=1)
tix
  • 2,138
  • 11
  • 18
  • The blog was a good read, thanks! I've also noticed the gsutil -m cp (mutli-threaded copy) was actually quite fast. Thanks for the reply! – Daniel Messias Jun 08 '18 at 08:28