0

I'm new to Pyspark and AWS Glue and I'm having an issue when I try to write out a file with Glue. When I try to write some output into s3 using Glue's write_dynamic_frame_from_options it's getting an exception and saying

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 199.0 failed 4 times, most recent failure:
 Lost task 0.3 in stage 199.0 (TID 7991, 10.135.30.121, executor 9): java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 7, schema size: 6
CSV file: s3://************************************cache.csv
    at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:180)
    at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:176)
    at scala.Option.foreach(Option.scala:257)
    at .....

It seems like its saying that my dataframe's schema has 6 fields, but the csv has 7. I don't understand which csv it's talking about, because I am actually trying to create a new csv from the dataframe... Any insight to this specific issue or to how the write_dynamic_frame_from_options method works in general would be very helpful!

Here is the source code for the function in my job that is causing this issue.



def update_geocache(glueContext, originalDf, newDf):
    logger.info("Got the two df's to union")
    logger.info("Schema of the original df")
    originalDf.printSchema()
    logger.info("Schema of the new df")
    newDf.printSchema()
    # add the two Dataframes together
    unioned_df = originalDf.unionByName(newDf).distinct()
    logger.info("Schema of the union")
    unioned_df.printSchema()
            ##root
            #|-- location_key: string (nullable = true)
            #|-- addr1: string (nullable = true)
            #|-- addr2: string (nullable = true)
            #|-- zip: string (nullable = true)
            #|-- lat: string (nullable = true)
            #|-- lon: string (nullable = true)



    # Create just 1 partition, because there is so little data
    unioned_df = unioned_df.repartition(1)
    logger.info("Unioned the geocache and the new addresses")
    # Convert back to dynamic frame
    dynamic_frame = DynamicFrame.fromDF(
        unioned_df, glueContext, "dynamic_frame")
    logger.info("Converted the unioned tables to a Dynamic Frame")
    # Write data back to S3
    # THIS IS THE LINE THAT THROWS THE EXCEPTION
    glueContext.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="s3",
        connection_options={
            "path": "s3://" + S3_BUCKET + "/" + TEMP_FILE_LOCATION
        },
        format="csv"
    )
SGolds
  • 119
  • 2
  • 16
  • It looks like your header might have additional comma, or a column.Can you post the header and record in your question and also while reading try disabling header dyF = glueContext.create_dynamic_frame.from_options('s3',{'paths': ['s3://path']},'csv',{'withHeader': False}) – Prabhakar Reddy Sep 17 '20 at 02:45
  • Thank you @PrabhakarReddy! I'll try the withheader false to see what happens... But, I don't understand your first comment. You asked that I post the header, what header are you referring to? Shouldn't it just write my dynamicframe into a csv? In the code above you can see the schema of the df – SGolds Sep 17 '20 at 13:44
  • I was talking about the source – Prabhakar Reddy Sep 17 '20 at 13:56
  • I just ran it again with "withHeaders": False, and it's still getting the same exception – SGolds Sep 17 '20 at 14:07
  • You can try passing the same flag while writing and did you enabled glue catalogue on this job? If yes try disabling that too – Prabhakar Reddy Sep 17 '20 at 14:11
  • thanks, That same flag doesn't seem to have an any effect on the write. I think there is a "writeHeaders" flag that I will try out. And no, I am not using a data catalogue – SGolds Sep 17 '20 at 15:35

0 Answers0