0

I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see code and error and screen shot of parquet file below .I am not sure why this is happening ? 2022-09-01 00:00:00 seems to be fine. if I choose "pyarrow" as the engine, it fails with exception

pyarrow error:
    pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

Please advise.

fastparquet error:

OverflowError: value too large
Exception ignored in: 'fastparquet.cencoding.time_shift'
OverflowError: value too large
OverflowError: value too large

code:

s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket="blah", Key="blah1")
df=pd.read_parquet(io.BytesIO(obj['Body'].read()),engine="fastparquet")
Bill
  • 363
  • 3
  • 14
  • `2022-10-06 00:00:00 = 1661990400000000 nanos`. `1661990400000000 // 1000` as a nanos timestamp is `1970-01-20 06:30:14.400`. It looks to me like your data is mixing us and ns timestamp. – 0x26res Apr 03 '23 at 08:04
  • Does this answer your question? [pyarrow timestamp datatype error on parquet file](https://stackoverflow.com/questions/75897897/pyarrow-timestamp-datatype-error-on-parquet-file) – 0x26res Apr 03 '23 at 08:04
  • no this does not answer @0x26res , I am not sure why the above date is a problem of overflow, it is between 1677 and 2262 years per https://stackoverflow.com/questions/55323548/what-determines-pandas-minimum-and-maximum-timestamp. Any ideas why this is happening ?, If I open the parquet file in parquet viewer online, I see 2022-10-06 00:00:00. – Bill Apr 04 '23 at 02:30

1 Answers1

0

When pyarrow and fastparquet agree that the data isn't valid, I expect it must be the case. As a comment suggests, it sounds like there is confusion in the column's time units. You didn't say where the data came from, but at a wild guess, this may be because of the change in parquet standard (roughly v1->v2), in which the former complex types were extended by new "logical" types. Newer parquet files tend to have BOTH styles of type declaration, so there is a chance they are inconsistent.

In fastparquet main branch (unreleased), there was some work to consolidate different ways of declaring time types. Maybe for your data, it now does the right thing.

mdurant
  • 27,272
  • 5
  • 45
  • 74