1

I'm testing out feather-format as a way to store pandas DataFrame files. The performance of feather seems to be extremely poor when writing columns consisting entirely of None (info() gives 0 non-null object). The following code well encapsulates the issue:

    df1 = pd.DataFrame(data={'x': 1000*[None]})
    %timeit df1.to_feather('.../x.feather')
    5.35 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit df1.to_pickle('.../x.pkl')
    734 ms ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit df1.to_parquet('.../x.parquet')
    200 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm using feather-format 0.4.0, pandas 0.23.4, and pyarrow 0.13.0.

How can I get these kinds of DataFrames to save without taking forever?

Tal Fishman
  • 111
  • 5
  • What is your question? – harvpan Sep 10 '19 at 19:27
  • Thanks, just added the question, as I guess it's not obvious "How can I get these kinds of DataFrames to save without taking forever?" – Tal Fishman Sep 10 '19 at 19:48
  • Can you update to a newer version of pandas and pyarrow? The versions you used are not the latest ones and the problem could have been fixed nowadays. – Uwe L. Korn Sep 11 '19 at 07:12
  • The issue has nothing to do with pandas itself, and I'm on the latest conda approved version of pyarrow. – Tal Fishman Sep 11 '19 at 12:54
  • The same can be seen on latest master of pyarrow. I opened https://issues.apache.org/jira/browse/ARROW-6529 to track this, as I agree that difference seems strange. – joris Sep 11 '19 at 13:36

1 Answers1

0

You could try adding a specific dtype. That being said, the numbers are a little surprising in terms of how poor feather performance is.

Micah Kornfield
  • 1,325
  • 5
  • 10