1

I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler.
This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged.

I do not think the problem is with the lambda job/wrangler, since it deposits the parquet files as expected. I have also tested that code separately and it works as expected.
Something is going on with the Glue data catalogue table that makes it increase versions despite no changes to the schema.

I have checked for differences in the underlying parquet files to see if there are some schema, data type etc changes between updates, and there are none. I have checked for differences between the Glue table versions via the console and AWS CLI (aws glue get-table-versions) and found no differences there either (only the UpdateTime and VersionId changes).

I have tried to recreate my setup with the same code and do not find this issue. I have tried to delete and recreate the Glue table in the same place, but the issue reoccurs.

Question: What could be causing my Glue table version numbers to increase when there are no schema changes?

Note: The code in question looks like this. It's part of a bigger function (this is really just generating logs of what the main lambda function is doing). It works fine on its own and doesn't use variables etc from the rest of the code. I don't see how this could be the issue but including it here anyway.

#other functions do some things when triggered by a new file in another s3 bucket

#this function is just logging which files were processed. It's the Glue table from these log files which is having issues with the version number increasing every time a new log file is added.
import aws-wrangler as wr
def log(resource, filename):
    log_df = build_log(resource, filename) # for building the log df, just columns of date, time, file used etc
    wr.s3.to_parquet(
        df=log_df,
        path=log_path(), #s3 bucket where parquet logs are being put
        dataset=True,
        catalog_versioning=False,
        database="MYDB",
        partition_cols=['date'],
        table='log',
        mode='append'
    )
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
km-if-so
  • 13
  • 3
  • we saw some similar behavior and seemed to be related to tracking the number of files and total size. planning to test some different serde types to see if it resolves the issue. we also found that this seems to be happening more often on non partitioned tables. there were some partitioned tables with `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` with the same behavior – DetroitMike May 24 '23 at 04:02

1 Answers1

1

This is, I think due to partitioning. You are partitioning based on date, so I guess for every day of time unit a new partition will be added. The new partitions are the reason why the table version is being incremented.

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
  • A good theory, but it also happens within a day as well. I.e. when no new date partition has been created. So for example if a log is created at 10pm we get table version 1, then another log is created at 10.02pm we get table version 2 even in the same partition. I've also tested the behaviour with a stripped down version and it behaves correctly (table version is consistent even when new partitions are added). – km-if-so Jul 27 '21 at 11:06
  • Hmm, I was sure that that would be it :D. Glue is so incredibly unreliable... – Robert Kossendey Jul 28 '21 at 07:12