0

I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to read the date value.

An example file and code to reproduce is

test.csv

t
3000-12-31
import pandas as pd
import pyarrow as pa
df = pd.read_csv("test.csv", parse_dates=["t"])
schema = pa.schema([pa.field("t", pa.date64())])
table = pa.Table.from_pandas(df, schema=schema)

This gives (a somewhat unhelpful error)

TypeError: an integer is required (got type str)

What's the right way to do this?

Kris Harper
  • 5,672
  • 8
  • 51
  • 96
  • You can't have that date represented as a datetime64; it's too large. The biggest date you can have is `pd.Timestamp.max` or `Timestamp('2262-04-11 23:47:16.854775807')`. So even though you specify it to parse the date, it tries and fails so it simply keeps the column as the string. Your only real option is to store it as a `Period`, `df['t'] = df['t].apply(pd.Period)`, but I don't know if pyarrow supports that dtype – ALollz Nov 02 '20 at 16:58
  • Yeah, I did try that, and indeed PyArrow doesn't seem to accept `Period` as a date type. So I need some workaround I guess. – Kris Harper Nov 02 '20 at 17:22

1 Answers1

1

Pandas datetime columns (which use the datetime64[ns] data type) indeed cannot store such dates.

One possible workaround to convert the strings to datetime.datetime objects in an object dtype column. And then pyarrow should be able to accept them to create a date column. This conversion could eg be done with dateutil:

>>> import dateutil
>>> df['t'] = df['t'].apply(dateutil.parser.parse)
>>> df
                     t
0  3000-12-31 00:00:00

>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]

or if you use a fixed format, using datetime.date.strptime is probably more reliable:

>>> import datetime
>>> df['t'] = df['t'].apply(lambda s: datetime.datetime.strptime(s, "%Y-%m-%d"))
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]
joris
  • 133,120
  • 36
  • 247
  • 202
  • This works really well, thanks. I think the only issue now is that `parse` (or maybe `parse` in combination with `apply`) is pretty slow. I think that's something I'll just have to live with. – Kris Harper Nov 02 '20 at 22:23
  • Yes, it is expected that this is slower than `pd.to_datetime`. An alternative might be to use numpy instead of pandas (`np.array(["2020-01-01"]).astype("datetime64[D]")` parses the strings as well). But then the conversion to a pyarrow table is a bit more complex/manual, as you can't use `from_pandas`. – joris Nov 03 '20 at 07:41