Spark Stand Alone - Last Stage saveAsTextFile takes many hours using very little resources to write CSV part files

Question

We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a.

We can see from the Spark UI, the first stages reading and merging to produce the final JavaRDD run at 100% CPU as expected, but the final stage writing out as CSV files using saveAsTextFile at package.scala:179 gets "stuck" for many hours on 2 of the 3 nodes with 2 of the 32 tasks taking hours (box is at 6% CPU, memory 86%, Network IO 15kb/s, Disk IO 0 for the entire period).

We are reading and writing uncompressed CSV (we found uncompressed was much faster than gzipped CSV) with re partition 16 on each of the three input DataFrames and not coleaseing the write.

Would appreciate any hints what we can investigate as to why the final stage takes so many hours doing very little on 2 of the 3 nodes in our standalone local cluster.

Many thanks

--- UPDATE ---

I tried writing to local disk rather than s3a and the symptoms are the same - 2 of the 32 tasks in the final stage saveAsTextFile get "stuck" for hours:

score 1 · Answer 1 · answered Nov 18 '16 at 15:35

1

If you are writing to S3, via s3n, s3a or otherwise, do not set spark.speculation = true unless you want to run the risk of corrupted output. What I suspect is happening is that the final stage of the process is renaming the output file, which on an object store involves copying lots (many GB?) of data. The rename takes place on the server, with the client just keeping an HTTPS connection open until it finishes. I'd estimate S3A rename time as about 6-8 Megabytes/second...would that number tie in with your results?

Write to local HDFS then, afterwards, upload the output.

gzip compression can't be split, so Spark will not assign parts of processing a file to different executors. One file: one executor.
Try and avoid CSV, it's an ugly format. Embrace: Avro, Parquet or ORC. Avro is great for other apps to stream into, the others better for downstream processing in other queries. Significantly better.
And consider compressing the files with a format such as lzo or snappy, both of which can be split.

see also slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores

answered Nov 18 '16 at 15:35

stevel

12,567
1
39
50

Hi Steve. Many thanks for the reply. Yes I've heard of the renaming issues, but our 4+ hours of idling would mean a 100GB output file - we are expecting about 5Gb. Unfortunately the up and downstream processes our out of our control and both use CSV. We are using Spark 1.6. Perhaps we should try upgrading to 2.0 if writing to HDFS then uploading doesn't solve the issue? – twiz911 Nov 21 '16 at 00:01
I also tried slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores which made no difference. – twiz911 Nov 21 '16 at 03:35
you can try the upgrade —its worth it for other reasons (hint: DataFrames), but it doesn't change the low level s3 connections or the file output committer. While the upload is locked, do a kill -QUIT to show the stacks of all the blocked threads – stevel Nov 21 '16 at 13:12
Hi Steve. Thanks for the advice. We changed to write to the local file system and still the last 2 tasks get stuck running for hours while the box does nothing. – twiz911 Nov 22 '16 at 00:22
Just wondered if you or anyone had any other ideas? Many thanks for your time and advice. – twiz911 Nov 25 '16 at 00:50
You are into diagnostics mode. turn up the debigging and see what's happening in the logs; use jstack to see what the threads are up to. – stevel Nov 25 '16 at 18:49
We have since upgraded to Spark 2.0.2 but still have the same issue with two tasks in the 32 task stage never completing and the box idling over for hours. – twiz911 Nov 30 '16 at 00:51

score 0 · Answer 2 · answered Nov 18 '16 at 05:53

0

I have seen similar behavior. There is a bug fix in HEAD as of October 2016 that may be relevant. But for now you might enable

spark.speculation=true

in the SparkConf or in spark-defaults.conf .

Let us know if that mitigates the issue.

answered Nov 18 '16 at 05:53

WestCoastProjects

58,982
91
316
560

Thanks. I tried that but the symptoms are the same with no noticeable change. – twiz911 Nov 21 '16 at 02:48

Spark Stand Alone - Last Stage saveAsTextFile takes many hours using very little resources to write CSV part files

2 Answers2