4

I am receiving an out of bounds timestamp error message when attempting to convert a pandas dataframe to a pyarrow Table and write to a parquet dataset. From some researching, it seems to be a a result of pandas using nanosecond precision and pyarrow only being able to interpret down to the millisecond precision, I believe.

import cx_Oracle
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

connection = cx_Oracle.connect(os.getenv('USER'), os.getenv('__OPW'), os.getenv('DB_SERVICE'))
gen = pd.read_sql('SELECT * FROM myschema.mytable where rownum < 10001', con=connection, chunksize=1_000)
for df in gen:
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table, root_path='/tmp/dataset', partition_cols=['my_part_col'])

ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000

When i comment out the last line:

# pq.write_to_dataset(table, root_path='/tmp/dataset', partition_cols=['my_part_col'])

...and re-run, the error message is no longer produced so it may be occurring from the conversion from pyarrow table to parquet.

Is there a known workaround for this?

Thanks.

Update:

Here's the full traceback...

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/parquet.py", line 1754, in write_to_dataset
    df = table.to_pandas()
  File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 779, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1114, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
  File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000
user9074332
  • 2,336
  • 2
  • 23
  • 39

1 Answers1

1

253402214400000000 micro seconds from epoch is the year 10`000.

Few libraries support this range for timestamps. You have a few options:

  • Truncate all the values that are out of range before converting to arrow/parquet
  • Convert the problematic column to a int64 or uint64 (instead of using timestamp)
  • Use date instead of timestamps. If you looking that far in the future you probably don't care what time of the day it is. Date have got a much bigger range.

EDIT:

If it's the way your db represents an invalid/missing date, you should replace all these dates to pd.NaT before converting to arrow.

0x26res
  • 11,925
  • 11
  • 54
  • 108