Amazon EMR and S3, org.apache.spark.sql.AnalysisException: path s3://..../var/table already exists

Question

I'm trying to find the source of a bug on Spark 2.0.0, I have a map that holds table names as keys and the dataframe as the value, I loop through it and at the end use spark-avro (3.0.0-preview2) to write everything to S3 directories. It runs perfect locally (with a local path instead of an s3 path of course), but when I run it on Amazon's EMR it runs for a while and then it says the folder already exists and terminates (which would mean that the same key value is being used in that for loop more than once, right?). Is this possibly an issue with threading?

for ((k, v) <- tableMap) {
  val currTable: DataFrame = tableMap(k)
  val decryptedCurrTable = currTable.withColumn("data", decryptUDF(currTable("data")))
  val decryptedCurrTableData = sparkSession.sqlContext.read.json(decryptedCurrTable.select("data").rdd.map(row => row.toString()))
  decryptedCurrTable.write.avro(s"s3://..../$k/table")
  decryptedCurrTableData.write.avro(s"s3://..../$k/tableData")

I do, but I believe I answered my own question. I added a .mode("append") after the write and everything worked, so it must've been a concurrency issue imo. — Brady Auen, Aug 03 '16 at 18:50

score 3 · Accepted Answer · answered Aug 03 '16 at 19:28

3

I think it was a concurrency issue, I changed my code to:

decryptedCurrTable.write.mode("append").avro(s"s3://..../$k/table")
decryptedCurrTableData.write.mode("append").avro(s"s3://..../$k/tableData")

And everything worked fine.

answered Aug 03 '16 at 19:28

Brady Auen

215
3
13

Hi, I was working on a similar use case, but based on what I see on your own answer, your key $k keeps changing meaning that you're not writing to the same destination, therefore "append" or "override" mode should not affect the concurrency? – c74ckds Dec 30 '19 at 15:12

Amazon EMR and S3, org.apache.spark.sql.AnalysisException: path s3://..../var/table already exists

1 Answers1