1

I'm starting to experiment with pyarrow, and I'm hitting a strange error when writing a CSV file. Say I have this CSV input as dates.csv:

dates
2022-10-04T15:52:25.000Z
2022-03-29T08:08:13.000Z
2023-01-05T19:24:13.000Z
2020-12-04T18:56:30.000Z

Now, if I just try to load and write back to CSV, here's what I get:

In [1]: from pyarrow import csv

In [2]: t = csv.read_csv("dates.csv")

In [3]: t
Out[3]:
pyarrow.Table
dates: timestamp[ns, tz=UTC]
----
dates: [[2022-10-04 15:52:25.000000000,2022-03-29 08:08:13.000000000,2023-01-05 19:24:13.000000000,2020-12-04 18:56:30.000000000]]

In [4]: csv.write_csv(t, "out.csv")
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[4], line 1
----> 1 csv.write_csv(t, "out.csv")

File c:\Users\user\miniconda3\envs\py311\Lib\site-packages\pyarrow\_csv.pyx:1483, in pyarrow._csv.write_csv()

File c:\Users\user\miniconda3\envs\py311\Lib\site-packages\pyarrow\error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Cannot locate timezone 'UTC': Timezone database not found at "C:\Users\user\Downloads\tzdata"

Now, I see here that it's hard coded that the timezone database be located in the profile's Downloads folder (on Windows). Not ideal, but workable, if I can find what exactly I need to place in that folder. Any hint?

Alternatively, I guess I could remove the timezone from the timestamp column, but I couldn't find how it's done in pyarrow.

In the end, I hope the backend will be updated so the location is no longer hard coded and the database is installed along with pyarrow.

mrgou
  • 1,576
  • 2
  • 21
  • 45
  • 2
    See https://stackoverflow.com/questions/74267313/how-to-use-tzdata-file-with-pyarrow-compute-assume-timezone/74292266#74292266 for an answer how to install a database manually – joris Jul 06 '23 at 13:56
  • There's also an [issue on github](https://github.com/apache/arrow/issues/35600) related to this. I find this a pretty surprising place to look for a tz database file on Windows btw... it's just weird. pyarrow should use [tzdata](https://pypi.org/project/tzdata/) imho. – FObersteiner Jul 06 '23 at 15:28
  • You may be able to bypass this issue by either interpreting the timestamps as strings and parse them manually, or passing your own format that ignores the timezone `t = csv.read_csv("dates.csv", convert_options= csv.ConvertOptions(timestamp_parsers=["%Y-%m-%dT%H:%M:%S.000Z"]))`. – 0x26res Jul 06 '23 at 17:37

0 Answers0