I'm starting to experiment with pyarrow, and I'm hitting a strange error when writing a CSV file. Say I have this CSV input as dates.csv
:
dates
2022-10-04T15:52:25.000Z
2022-03-29T08:08:13.000Z
2023-01-05T19:24:13.000Z
2020-12-04T18:56:30.000Z
Now, if I just try to load and write back to CSV, here's what I get:
In [1]: from pyarrow import csv
In [2]: t = csv.read_csv("dates.csv")
In [3]: t
Out[3]:
pyarrow.Table
dates: timestamp[ns, tz=UTC]
----
dates: [[2022-10-04 15:52:25.000000000,2022-03-29 08:08:13.000000000,2023-01-05 19:24:13.000000000,2020-12-04 18:56:30.000000000]]
In [4]: csv.write_csv(t, "out.csv")
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[4], line 1
----> 1 csv.write_csv(t, "out.csv")
File c:\Users\user\miniconda3\envs\py311\Lib\site-packages\pyarrow\_csv.pyx:1483, in pyarrow._csv.write_csv()
File c:\Users\user\miniconda3\envs\py311\Lib\site-packages\pyarrow\error.pxi:100, in pyarrow.lib.check_status()
ArrowInvalid: Cannot locate timezone 'UTC': Timezone database not found at "C:\Users\user\Downloads\tzdata"
Now, I see here that it's hard coded that the timezone database be located in the profile's Downloads
folder (on Windows). Not ideal, but workable, if I can find what exactly I need to place in that folder. Any hint?
Alternatively, I guess I could remove the timezone from the timestamp
column, but I couldn't find how it's done in pyarrow.
In the end, I hope the backend will be updated so the location is no longer hard coded and the database is installed along with pyarrow.