I'm new to Pyspark and AWS Glue and I'm having an issue when I try to write out a file with Glue. When I try to write some output into s3 using Glue's write_dynamic_frame_from_options it's getting an exception and saying
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 199.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 199.0 (TID 7991, 10.135.30.121, executor 9): java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
Header length: 7, schema size: 6
CSV file: s3://************************************cache.csv
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:180)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:176)
at scala.Option.foreach(Option.scala:257)
at .....
It seems like its saying that my dataframe's schema has 6 fields, but the csv has 7. I don't understand which csv it's talking about, because I am actually trying to create a new csv from the dataframe... Any insight to this specific issue or to how the write_dynamic_frame_from_options method works in general would be very helpful!
Here is the source code for the function in my job that is causing this issue.
def update_geocache(glueContext, originalDf, newDf):
logger.info("Got the two df's to union")
logger.info("Schema of the original df")
originalDf.printSchema()
logger.info("Schema of the new df")
newDf.printSchema()
# add the two Dataframes together
unioned_df = originalDf.unionByName(newDf).distinct()
logger.info("Schema of the union")
unioned_df.printSchema()
##root
#|-- location_key: string (nullable = true)
#|-- addr1: string (nullable = true)
#|-- addr2: string (nullable = true)
#|-- zip: string (nullable = true)
#|-- lat: string (nullable = true)
#|-- lon: string (nullable = true)
# Create just 1 partition, because there is so little data
unioned_df = unioned_df.repartition(1)
logger.info("Unioned the geocache and the new addresses")
# Convert back to dynamic frame
dynamic_frame = DynamicFrame.fromDF(
unioned_df, glueContext, "dynamic_frame")
logger.info("Converted the unioned tables to a Dynamic Frame")
# Write data back to S3
# THIS IS THE LINE THAT THROWS THE EXCEPTION
glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": "s3://" + S3_BUCKET + "/" + TEMP_FILE_LOCATION
},
format="csv"
)