Dataframe save to Redshift from Spark2 job running on dataproc cluster stalls

Question

I have a dataframe (Dataset) and want to save this dataframe to Redshift.

df.write()
    .format("com.databricks.spark.redshift")
    .option("url", url)
    .option("dbtable", dbTable)
    .option("tempdir", tempDir)
    .mode("append")
    .save();

Setup:

Spark (spark-core, spark-sql): 2.0.1/Scala: 2.11
JDBC driver to connect to Redshift (postgresql): 9.4.1208.jre7
AWS SDKs (aws-java-sdk-core, aws-java-sdk-s3): 1.11.48

Just before I create the write table in Redshift which works just fine (using the Postgres JDBC driver). However, after the table creation my job basically stalls without me being able to extract any more helpful information from the logs. What could be the reason for that?

I have tried setting the auth-credentials as part of the tempdir as well as using a Hadoop config on the Spark context as described here. Both ways work locally but could there be an issue when submitting the job to run on dataproc?

One thing that would be useful information for any potential answerers would be if you can get a "jstack" of the stalling process (possibly your driver program running on the Dataproc master node). It might be running as root or as your own username; you have to "sudo jstack " or "sudo -u jstack " for it to work. — Dennis Huo, Nov 01 '16 at 21:10
Which version of `spark-redshift` are you using? +1 on grabbing a `jstack`; I suspect that Spark is blocking while waiting on Redshift to ingest the written data. — Josh Rosen, Nov 15 '16 at 01:04

Dataframe save to Redshift from Spark2 job running on dataproc cluster stalls

0 Answers0