2

The following code fails with

pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: -221047-10-07 10:43:35

from pathlib import Path

import dask.dataframe as dd
import numpy as np
import pandas as pd
import tempfile


def run():
    temp_folder = tempfile.TemporaryDirectory()
    rng = np.random.default_rng(42)
    filenames = []
    for i in range(2):
        filename = Path(temp_folder.name, f"file_{i}.gzip")
        filenames.append(filename)
        df = pd.DataFrame(
            data=rng.normal(size=(365, 1500)),
            index=pd.date_range(
                start="2021-01-01",
                end="2022-01-01",
                closed="left",
                freq="D",
            ),
        )
        df.columns = df.columns.astype(str)
        df.to_parquet(filename, compression="gzip")
    df = dd.read_parquet(filenames)
    result = df.mean().mean().compute()
    temp_folder.cleanup()
    return result


if __name__ == "__main__":
    run()

Why does this (sample) code fail?

What I'm trying to do: The loop resembles creating data which is larger than memory in batches. In the next step I'd like to read that data from the files and work with it in dask.

Observations:

If I only read one file

for i in range(1):

the code is working.

If I don't use the DateTimeIndex

df = pd.DataFrame(
            data=rng.normal(size=(365, 1500)),
        )

the code is working.

If I use pandas only

df = pd.read_parquet(filenames)
result = df.mean().mean()

the code is working. (which is odd since read_parquet in pandas only expects one path not a collection)

If I use the distributed client with concat as suggested here I get a similar error pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 68024-12-20 01:46:56 Therefore I omitted the client in my sample.

Arigion
  • 3,267
  • 31
  • 41
  • 3
    Please provide the parquet engine you are using (pyarrow or fastparquet) and the version – mdurant Aug 10 '21 at 14:17
  • @mdurant I see fastparquet 0.7.1 and pyarrow 3.0.0 in the dependencies (via other packages). Maybe pandas and dask are using incompatible versions? I'll try in a fresh env. Thanks. – Arigion Aug 10 '21 at 15:41
  • 2
    Please explicitly provide `engine=` to your parquet commands, as I thin pandas and dask have different defaults. – mdurant Aug 11 '21 at 13:13
  • @mdurant Thanks. Providing the engine helped. I've answered my own question for the time being. If you'd like to provide an answer yourself I'm happy to accept it as solution and delete mine. – Arigion Aug 15 '21 at 09:09

1 Answers1

1

Thanks to the helpful comment from @mdurant, providing the engine helped:

engine='fastparquet'  # or 'pyarrow'
df.to_parquet(filename, compression="gzip", engine=engine)
df = dd.read_parquet(filenames, engine=engine)

Apparently engine='auto' selects different engines in dask vs. pandas when more than one parquet engine is installed.

Sidenote: I've tried different combinations of the engine and this triggers the error in the question:

df.to_parquet(filename, compression="gzip", engine='pyarrow')
df = dd.read_parquet(filenames, engine='fastparquet')
Arigion
  • 3,267
  • 31
  • 41