0

The original datetime in a dict array is

data= [
  {
    eob:datetime.datetime(2022, 8, 5, 9, 35, tzinfo=tzfile('PRC'))
  },
  {
    eob:datetime.datetime(2022, 8, 5, 9, 40, tzinfo=tzfile('PRC'))
  }
]
table = pa.Table.from_pylist(data)
print(table)

result is

pyarrow.Table
eob: timestamp[us, tz=PRC]
----
eob: [[2022-08-05 01:35:00.000000,2022-08-05 01:40:00.000000]]

The datetime in table changed to utc time. How can create the table without change the datetime?

colinshen
  • 1
  • 1

2 Answers2

3

Arrow internally stores datetime as UTC + timezone info and will print it as such. However if you print timestamps with strftime result will be in timezone of the timestamp.

import pyarrow as pa
import pyarrow.compute as pc
import datetime
from dateutil.tz import tzfile

tz = tzfile('/usr/share/zoneinfo/PRC')
times = pa.array([datetime.datetime(2022, 8, 5, 9, 35, tzinfo=tz), datetime.datetime(2022, 8, 5, 9, 40, tzinfo=tz)])
table = pa.Table.from_arrays([times], names=["times"])

print(table)
print("\n")
print(pc.strftime(table["times"], "%Y-%m-%d %H:%M:%S"))

Will print:

pyarrow.Table
times: timestamp[us, tz=PRC]
----
times: [[2022-08-05 01:35:00.000000,2022-08-05 01:40:00.000000]]


[
  [
    "2022-08-05 09:35:00.000000",
    "2022-08-05 09:40:00.000000"
  ]
]
Rok
  • 406
  • 3
  • 6
  • Thanks for your response. I need to save into parquet files, and read later.The date is utc which I could not use. What should I do? Change the timezone in dataframe? – colinshen Aug 08 '22 at 13:21
  • As far as I know storing to parquet and reading it back later should maintain the timezone. Timestamp is stored in UTC by design and timezone information is kept in case you need it back in you local timezone. Is that really a problem in your application? If it is could you give an example of what the problem is? – Rok Aug 09 '22 at 15:08
  • I use it for trading system. In order to reduce the query time, I need to save the data locally after market closed.For example, the time range of the original data are from 09:30 to 11:30(market close and save data), but in utc is 01:30 to 03:30. In the afternoon, the time starts from 13:00 RPC. Reading the history data append new data into the dataframe the time sequence is wrong and become 01:30 - 03:30, 13:00-end. It should be 09:30-11:30, 13:00-end. – colinshen Aug 12 '22 at 14:29
  • Are you using multiple timezones in your application? If not you can just remove timezone information. – Rok Aug 14 '22 at 01:31
  • As for the problem you're describing - it seems you're losing timezone information when you save and load. Could you provide a short and working code snippet of what you're doing? – Rok Aug 14 '22 at 01:32
  • I'm not using multiple timezone. See a simple code here: https://gist.github.com/Tatamethues/0e67702f9bbf0d75a6add35d2ac4d213 – colinshen Aug 15 '22 at 02:58
  • If you are operating with one timezone only you can just remove the timezone info and work in local time (see my reply in your gist). – Rok Aug 16 '22 at 11:47
  • Again please note: Arrow stores all times in UTC but those times represent the same instant in time as local times. Depending on your application you might want to convert back to local time but I don't think you actually need to if you're looking at market data. – Rok Aug 16 '22 at 11:49
0

Arrow 12.0.0 will add local_time kernel. This will enable you to do this:

import pyarrow as pa
import pyarrow.compute as pc
import datetime
from dateutil.tz import tzfile

tz = tzfile('/usr/share/zoneinfo/PRC')
times = pa.array([datetime.datetime(2022, 8, 5, 9, 35, tzinfo=tz), datetime.datetime(2022, 8, 5, 9, 40, tzinfo=tz)])
local_times = pa.local_time(times)
table = pa.Table.from_arrays([times], names=["times"])

Which should print:

pyarrow.Table
times: timestamp[us]
----
times: [[2022-08-05 01:35:00.000000,2022-08-05 01:40:00.000000]]


[
  [
    "2022-08-05 09:35:00.000000",
    "2022-08-05 09:40:00.000000"
  ]
]
Rok
  • 406
  • 3
  • 6