Pandas' feather format is slow when writing a column of None

Question

I'm testing out feather-format as a way to store pandas DataFrame files. The performance of feather seems to be extremely poor when writing columns consisting entirely of None (info() gives 0 non-null object). The following code well encapsulates the issue:

    df1 = pd.DataFrame(data={'x': 1000*[None]})
    %timeit df1.to_feather('.../x.feather')
    5.35 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit df1.to_pickle('.../x.pkl')
    734 ms ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit df1.to_parquet('.../x.parquet')
    200 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm using feather-format 0.4.0, pandas 0.23.4, and pyarrow 0.13.0.

How can I get these kinds of DataFrames to save without taking forever?

Thanks, just added the question, as I guess it's not obvious "How can I get these kinds of DataFrames to save without taking forever?" — Tal Fishman, Sep 10 '19 at 19:48
Can you update to a newer version of pandas and pyarrow? The versions you used are not the latest ones and the problem could have been fixed nowadays. — Uwe L. Korn, Sep 11 '19 at 07:12
The issue has nothing to do with pandas itself, and I'm on the latest conda approved version of pyarrow. — Tal Fishman, Sep 11 '19 at 12:54
The same can be seen on latest master of pyarrow. I opened https://issues.apache.org/jira/browse/ARROW-6529 to track this, as I agree that difference seems strange. — joris, Sep 11 '19 at 13:36

score 0 · Answer 1 · answered Sep 11 '19 at 03:02

0

You could try adding a specific dtype. That being said, the numbers are a little surprising in terms of how poor feather performance is.

answered Sep 11 '19 at 03:02

Micah Kornfield

1,325
5
10

Pandas' feather format is slow when writing a column of None

1 Answers1