I have a AWS Glue job that was slightly modified, only the read was changed, the job runs fine however the datatypes on my columns have changed. Where I previously had BigInt, I now just have Ints. This is causing an EMR Job dependent on these files to error out due to the schema mismatch. I'm not sure what would cause this issue since the mapping did not change, so if anyone has insight that would be great here is the old & new code:
///OLD
val inputsourceDF = spark.read.format("json").load(inputFilePath)
val inputsource = DynamicFrame(inputsourceDF, glueContext)
///NEW
val inputsource = glueContext.getSourceWithFormat(connectionType = "s3", options = JsonOptions(Map("paths" -> Set(inputFilePath))), format = "json", transformationContext = "inputsource").getDynamicFrame()
///WRITE which did not change
val inputsink = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(s"""{"path": "$inputOutputFilePath"}"""), transformationContext = "inputdatasink", format = "parquet").writeDynamicFrame(inputdropnullfields.coalesce(inputPartitionCount))
And these are the tables created when crawling the files after the glue job
CREATE EXTERNAL TABLE `input_new`(`id` int)
CREATE EXTERNAL TABLE `input_old`(`id` bigint)
We added this change so that we could utilize Bookmarks, any help would be appreciated.