Update
When loading the files using DataFrame I achieved far superior performance. Haven't had a chance to look into why this, but reading like this then converting to RDD is the best solution I've found so far.
sparkSession.read.text("gs://bucket/some/sub/directory/prefix*")
I'm trying to do a simple read of files in a GCS bucket into Dataproc Spark, and insert into a Hive table. I'm getting very poor network bandwidth (max 1MB/s) on downloading the files from the bucket.
Cluster: 3 x n1-standard-4 (one is master).
Bucket has 1440 GZIPed items, approx. 4MB each.
I am loading into spark using
sc.textFile("gs://bucket/some/sub/directory/prefix*")
My dataproc cluster and GCS bucket are in the same region/zone. The bucket is regional only (not multi-regional).
I have observed that increasing the size of my cluster will increase my maximum network bandwidth, but I don't want to use a massive cluster just to get decent network bandwidth.
If I were to download the same data using gsutil cp (running on the dataproc master VM instance) it takes ~30 seconds only.
Is there some setting I am missing, or perhaps the sc.textFile(...) approach is highly un-optimal for GCS?
Thanks