I have a dataframe (Dataset) and want to save this dataframe to Redshift.
df.write()
.format("com.databricks.spark.redshift")
.option("url", url)
.option("dbtable", dbTable)
.option("tempdir", tempDir)
.mode("append")
.save();
Setup:
- Spark (spark-core, spark-sql): 2.0.1/Scala: 2.11
- JDBC driver to connect to Redshift (postgresql): 9.4.1208.jre7
- AWS SDKs (aws-java-sdk-core, aws-java-sdk-s3): 1.11.48
Just before I create the write table in Redshift which works just fine (using the Postgres JDBC driver). However, after the table creation my job basically stalls without me being able to extract any more helpful information from the logs. What could be the reason for that?
I have tried setting the auth-credentials as part of the tempdir as well as using a Hadoop config on the Spark context as described here. Both ways work locally but could there be an issue when submitting the job to run on dataproc?