Why s3.to_parquet switching data types on publish to AWS Glue?

Question

I'm creating a dataframe like so: concatdatafile = pd.concat(datafile, axis=0, ignore_index=True, sort=False)

then checking some of the field data types before publish:

 logger.info("  *** concatdatafile['FS Seal Time (sec)'].dtypes={}".format(concatdatafile['FS Seal Time (sec)'].dtypes))
        logger.info("  *** concatdatafile['FS Cool Time (sec)'].dtypes={}".format(concatdatafile['FS Cool Time (sec)'].dtypes))

The next statement I have is a write:

response_wr = wr.s3.to_parquet(df=concatdatafile, path=s3_outputpath + 'full_data/', dataset=True,
                                        partition_cols=["MachineId", "year_num", "month_num", "day_num"], database='myDB',
                                        table='myDBTable', mode='append')

when I run this code in Glue, I get:

(Note: I cleared out the glue definition before running, so it would have fresh metadata)

but in the Glue table, I'm seeing the fields changing types like so:

Question:...Why is it not respecting the data types I'm publishing? It sees the data looks like doubles (for now), but that's irrelevant. Later data will be strings, so I want it to not override the types I'm sending.

It appears you are using pandas, was there a specific need to use that? How are you loading your datafile frame? — jonlegend, Jun 07 '21 at 19:05
I'm using pandas pd.read_csv to load the data in. I'm also using pandas .astype to nudge columns to certain formats. — Alex P, Jun 07 '21 at 20:08
pandas would force you to do additional conversions between pandas dataframes and pyspark dataframes, e.g. with Apache Arrow. You would likely be better off performance wise to stay just with PySpark instead. There are capabilities directly in glue dynamic frames to coerce data https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-medicaid.html. Glue can also read the CSV in either with a glue crawler (from_catalog) or directly from S3 (from_options) https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html. — jonlegend, Jun 08 '21 at 17:02

Why s3.to_parquet switching data types on publish to AWS Glue?

0 Answers0