I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler.
This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged.
I do not think the problem is with the lambda job/wrangler, since it deposits the parquet files as expected. I have also tested that code separately and it works as expected.
Something is going on with the Glue data catalogue table that makes it increase versions despite no changes to the schema.
I have checked for differences in the underlying parquet files to see if there are some schema, data type etc changes between updates, and there are none. I have checked for differences between the Glue table versions via the console and AWS CLI (aws glue get-table-versions) and found no differences there either (only the UpdateTime and VersionId changes).
I have tried to recreate my setup with the same code and do not find this issue. I have tried to delete and recreate the Glue table in the same place, but the issue reoccurs.
Question: What could be causing my Glue table version numbers to increase when there are no schema changes?
Note: The code in question looks like this. It's part of a bigger function (this is really just generating logs of what the main lambda function is doing). It works fine on its own and doesn't use variables etc from the rest of the code. I don't see how this could be the issue but including it here anyway.
#other functions do some things when triggered by a new file in another s3 bucket
#this function is just logging which files were processed. It's the Glue table from these log files which is having issues with the version number increasing every time a new log file is added.
import aws-wrangler as wr
def log(resource, filename):
log_df = build_log(resource, filename) # for building the log df, just columns of date, time, file used etc
wr.s3.to_parquet(
df=log_df,
path=log_path(), #s3 bucket where parquet logs are being put
dataset=True,
catalog_versioning=False,
database="MYDB",
partition_cols=['date'],
table='log',
mode='append'
)