The following code fails with
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: -221047-10-07 10:43:35
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd
import tempfile
def run():
temp_folder = tempfile.TemporaryDirectory()
rng = np.random.default_rng(42)
filenames = []
for i in range(2):
filename = Path(temp_folder.name, f"file_{i}.gzip")
filenames.append(filename)
df = pd.DataFrame(
data=rng.normal(size=(365, 1500)),
index=pd.date_range(
start="2021-01-01",
end="2022-01-01",
closed="left",
freq="D",
),
)
df.columns = df.columns.astype(str)
df.to_parquet(filename, compression="gzip")
df = dd.read_parquet(filenames)
result = df.mean().mean().compute()
temp_folder.cleanup()
return result
if __name__ == "__main__":
run()
Why does this (sample) code fail?
What I'm trying to do: The loop resembles creating data which is larger than memory in batches. In the next step I'd like to read that data from the files and work with it in dask.
Observations:
If I only read one file
for i in range(1):
the code is working.
If I don't use the DateTimeIndex
df = pd.DataFrame(
data=rng.normal(size=(365, 1500)),
)
the code is working.
If I use pandas only
df = pd.read_parquet(filenames)
result = df.mean().mean()
the code is working. (which is odd since read_parquet in pandas only expects one path not a collection)
If I use the distributed client with concat as suggested here I get a similar error pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 68024-12-20 01:46:56
Therefore I omitted the client in my sample.