1

I am trying to load data from a csv into a parquet file using pyarrow. I am using the convert options to set the data types to their proper type and then using the timestamp_parsers option to dictate how the timestamp data should be interpreted: please see my "csv" below:

time,data
01-11-19 10:11:56.132,xxx

Please see my code sample below.

import pyarrow as pa
from pyarrow import csv
from pyarrow import parquet


convert_dict = {
    'time': pa.timestamp('us', None),
    'data': pa.string()
}

convert_options = csv.ConvertOptions(
    column_types=convert_dict
    , strings_can_be_null=True
    , quoted_strings_can_be_null=True
    , timestamp_parsers=['%d-%m-%y %H:%M:%S.%f']
)

table = csv.read_csv('test.csv', convert_options=convert_options)
print(table)
parquet.write_table(table, 'test.parquet')

Basically, pyarrow doesn't like some strptime values. Specifically in this case, it does not like "%f" which is for fractional seconds (https://www.geeksforgeeks.org/python-datetime-strptime-function/). Any help to get pyarrow to do what I need would be appreciated.

Just to be clear, I can get the code to run if I edit the data to not have fractional seconds and then remove the "%f" from the timestamp_parsers option. However I need to maintain the integrity of the data so this is not an option. To me it seems like a bug in pyarrow or I'm an idiot and missing something obvious. Open to both options just want to know which it is.

2 Answers2

2

%f is not supported in pyarrow and most likely won't be as it's a Python specific flag. See discussion here: https://issues.apache.org/jira/browse/ARROW-15883 . PRs are of course always welcome!

As a workaround you could first read timestamps as strings, then process them by slicing off the fractional part and add that as pa.duration to processed timestamps:

import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)
Rok
  • 406
  • 3
  • 6
0

So I have found that for timestamp data, you should just try to have the data in the default parser format (ISO8601). For example if you convert csv data into parquet using the pyarrow timestamp data type. Just have the csv data in this format:

No time zone

YYYY-MM-DDTHH:MI:SS.FF6

With time zone

YYYY-MM-DDTHH:MI:SS.FF6TZH:TZM