0

I have a cross-sectional dataset. I have many variables including "wealth". I have a decile dummy which divides wealth into 10 groups using the following

df['decile'] = df['wealth'].transform(lambda x: pd.qcut(x, 10, labels=False))

My df (very important to point out that some variables have NaNs).

ID   wealth   age   income   ... many many variables..  decile 
A    10000    30     4000                                  5
B       10    19      500                                  1
C  1000000    37     6000                                  9
D     2842    22       0                                   4
E   399932    44     NaN                                   8
F     2344    19       0                                   4
G     5000    18       0                                   4
H
I
..

I want to create a summary stat of variables of my choosing for the bottom decile decile=0 and the top decile decile=9, and display mean, median and std.

desired output

          bottom decile           top decile         difference in means
        mean  median  std      mean  median  std         
wealth  ..     ..      ..       ..     ..     ..             .. *** (if statistically significant)
age     ..     ..      ..       ..     ..     ..             .. **
income  ..     ..      ..       ..     ..     ..
..
..
..

Is there a easy way to do this using python, instead of having to calculate individually?

Olive
  • 644
  • 4
  • 12

1 Answers1

0

What about this:

df.groupby('decile').describe().T

And you get an output like this

enter image description here

Andrea Ierardi
  • 420
  • 3
  • 10