0

I have an external table in Redshift spectrum with DDL having datetime column as somewhat below:

collector_tstamp TIMESTAMP WITHOUT TIME ZONE

Objective: I am trying to parquet a certain set of data and then add the partition into Spectrum to see if values are perfectly parsed.

After creating the parquet, following is the metadata of 'collector_tstamp' from corresponding parquet file:

{"metadata": null, "field_name": "collector_tstamp", "name": "collector_tstamp", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}

Before this, I am converting the pandas dataframe column into datetime data type with the help of following:

df['collector_tstamp'] = pd.to_datetime(df['collector_tstamp'])

Issue: When I am querying the data in Redshift spectrum, I am seeing the below values so clearly the parsing is wrong but I am not sure what data type should I change to so that it parses correctly:

collector_tstamp
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01
36332-04-23 15:29:01

I also tried something like this but still I am getting the above value:

df['collector_tstamp'] = pd.to_datetime(df['collector_tstamp'], infer_datetime_format=True)

I also tried this but still the same value when I query it:

df['collector_tstamp'] = df['collector_tstamp'].astype('datetime64[s]')

Sample data:

collector_tstamp
2019-01-04 04:02:36
2019-01-04 04:03:41
2019-01-04 04:03:45
2019-01-04 04:04:11
Gagan
  • 1,775
  • 5
  • 31
  • 59

1 Answers1

1

Nanosecond timestamps as written by pyarrow's default is quite new and probably not understood correctly by the current Redshift version. Have a look at the documentation at https://arrow.apache.org/docs/python/parquet.html and either try writing the files using flavor='spark' or read about tge other settings in the section "Storing timestamps".

As you probably cannot pass all parameters correctly through pandas.DataFrame.to_parquet, you need to use the following code to write a parquet file:

import pyarrow as pa
import pyarrow.parquet as pq

df = ..some DataFrame..
table = pa.Table.from_pandas(df)
pq.write_table(table, 'filename.parquet')
Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42
  • Thanks. I was already using pyarrow but these coerce_timestamps='ms', allow_truncated_timestamps=True helped me in parsing it correctly – Gagan Aug 31 '19 at 22:37