Feather read failure from pandas with pd.cut data

Question

The following read-write sequence through feather throws with error:

ArrowInvalid: Ran out of field metadata, likely malformed

dg = pd.DataFrame({'A': pd.cut((1, 2, 3), bins=[0, 1, 2])})
file_name = 'myfile.feather'
dg.to_feather(file_name)
dg=pd.read_feather(file_name)

(it's surprising to me that the to_feather step doesn't complain).

I checked that the following works fine, which suggests it's not just a categorical variable limitation - must be something related to the type returned by pd.cut.

dg = pd.DataFrame({"A": list("123321")}, dtype="category")
file_name = 'myfile.feather'
dg.to_feather(file_name)
dg=pd.read_feather(file_name)

I really want to not lose the categorical meta-data resulting from my cut as I persist the df through feather. Any recommendations?

I am using pandas 1.5.3, pyarrow 11.0.0 and python 3.8.13 on linux.

score 0 · Answer 1 · answered Jul 25 '23 at 15:10

0

I hope, it works for your solution. You have to replace ( with [ and convert it into string.

dg = pd.DataFrame({'A': pd.cut(range(1, 2, 3), bins=[0, 1, 2])})
print(dg['A'][0]) 
# replace ( with [ and convert it into string
dg['A'] = dg['A'].astype(str).str.replace('(', '[', regex=True)
print(dg['A'][0]) 
file_name = 'myfile.feather'
dg.to_feather(file_name)
dg = pd.read_feather(file_name)
dg

answered Jul 25 '23 at 15:10

Muhammad Ali

444
7

What exactly is that magic about? is that basically converting the values to string? I tried it and end up with: `Name: A, dtype: object` so I have lost the categorical nature of the data in the process – gg99 Jul 25 '23 at 15:56

score 0 · Answer 2 · answered Jul 25 '23 at 16:01

I figured out two things:

it seems to be a tracked bug (unresolved as of this post): https://github.com/apache/arrow/issues/20196
a somewhat decent (preserves the ordering etc.) workaround is to pass the labels= argument to pd.cut. So I came up with this wrapper for automatically setting labels:

 def my_cut(series, bins):
    return pd.cut(series, bins=bins, labels=[i for i in range(len(bins)-1)])

usage (fairly transparent):

dg = pd.DataFrame({'A': my_cut((1, 2, 3), bins=[0, 1, 2])})

Feather read failure from pandas with pd.cut data

2 Answers2