3

Trying to use pivot_table in dask while maintaining a sorted index. I have a simple pandas dataframe that looks something like this:

# make dataframe, fist in pandas and then in dask
df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 'B': ['a', 'b', 'c', 'a', 'b', 'c', 'a','b', 'c'], 'dist': [0, .1, .2, .1, 0, .3, .4, .1,  0]})

df.sort_values(by='A', inplace=True)
dd = dask.dataframe.from_pandas(df, chunksize=3)  # just for demo's sake, you obviously don't ever want a chunksize of 3
print(dd.known_divisions)  # Here I get True, which means my data is sorted

# now pivot and see if the index remains sorted
dd = dd.categorize('B')
pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
print(pivot_dd.known_divisions) # Here I get False, which makes me sad

I would love to find a way to get pivot_dd to have a sorted index, but I don't see a sort_index method in dask and cannot set 'A' as an index w/out getting a key error (it already is the index!).

In this toy example, I could pivot the pandas table first and then sort. The real application I have in mind won't allow me to do that.

Thanks in advance for any help/suggestions.

benten
  • 1,995
  • 2
  • 23
  • 38

1 Answers1

2

This may not be what you were wishing for, and perhaps not even the best answer, but it does seem to work. The first wrinkle, is that pivot operations create a categorical index for the columns, which is annoying. You could do the following.

>>> pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
>>> pivot_dd.columns = list(pivot_dd.columns)
>>> pivot_dd = pivot_dd.reset_index().set_index('A', sorted=True)
>>> pivot_dd.known_divisions
True
mdurant
  • 27,272
  • 5
  • 45
  • 74