3

I am trying to use the describe() and unstack() function in dask to get the summary statistics of the data.

However, i get an error as shown below

import dask.dataframe as dd
df = dd.read_csv('Measurement_table.csv',assume_missing=True)
df.describe().compute() #this works but when I try to use `unstack`, i get an error

Actually I am trying to make the below python pandas code to work faster with the help of dask

df.groupby(['person_id','measurement_concept_id','visit_occurrence_id'])['value_as_number']
    .describe()
    .unstack()
    .swaplevel(0,1,axis=1)
    .reindex(df['readings'].unique(), axis=1, level=0)

I tried adding compute() to each output stage as shown below

df1 = df.groupby(['person_id','measurement_concept_id','visit_occurrence_id'])['value_as_number'].describe().unstack().swaplevel(0,1,axis=1).reindex(df['readings'].unique(), axis=1, level=0).compute()

I get the below error but the same works well in pandas

enter image description here

Can anyone help me fix this issue?

The Great
  • 7,215
  • 7
  • 40
  • 128

1 Answers1

4

In dask unstack is not implemented, but describe is possible use with apply:

df = (sd.groupby(['subject_id','readings'])['val']
        .apply(lambda x: x.describe())
        .reset_index()
        .rename(columns={'level_2':'func'})
        .compute()
        )
print (df)
    subject_id readings   func        val
0            1   READ_1  count   2.000000
1            1   READ_1   mean   6.000000
2            1   READ_1    std   1.414214
3            1   READ_1    min   5.000000
4            1   READ_1    25%   5.500000
..         ...      ...    ...        ...
51           4  READ_09    min  45.000000
52           4  READ_09    25%  45.000000
53           4  READ_09    50%  45.000000
54           4  READ_09    75%  45.000000
55           4  READ_09    max  45.000000

[112 rows x 4 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can't we do this via melt in Dask? – The Great Oct 17 '19 at 06:11
  • @SSMK - Yes - https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.reshape.melt – jezrael Oct 17 '19 at 06:12
  • @SSMK - I try also `pivot_table` like `df = (sd.groupby(['subject_id','readings'])['val'] .apply(lambda x: x.describe()) .reset_index() .rename(columns={'level_2':'func'}) .assign(readings = lambda x: pd.Categorical(x['readings'] + '_' + x['func'])) .pivot_table(index='subject_id', columns='readings', values='val') .compute() )` but get `NotImplementedError: Series getitem in only supported for other series objects with matching partition structure` – jezrael Oct 17 '19 at 07:14
  • 1
    I understand. Thank you for your time and help.Much appreciated – The Great Oct 17 '19 at 07:18
  • I see that `dask` is also bit slow. I mean it still takes time.. Guess, I should do it in source table itself using SQL. – The Great Oct 17 '19 at 07:28
  • @SSMK - ya, if large data in pandas need a lot of RAM, so it is main reason for slowiest – jezrael Oct 17 '19 at 07:29
  • Okay, you mean my system config is not good enough to process this fast. Am I right? – The Great Oct 17 '19 at 07:30
  • @SSMK - It depends of your system, but if working with large data need 32, 64GB RAM, so need run code in some server, not by NB or PC (there is obviously 4, 8, 16GB only) – jezrael Oct 17 '19 at 07:31
  • Though my required output is different. But thank Jezrael for trying this. Since dask package doesn't have the unstack operation, I mark Jezrael answer as solution for his efforts. If any one could do unstack, I can modify it later. – The Great Oct 18 '19 at 01:06