3
import pandas as pd
import numpy as np

a = list("ABC") * 4
value = np.random.randint(-5, 5, 12)
df = pd.DataFrame({"A": a, "value": value})
print(df)

    A  value
0   A     -5
1   B      1
2   C      0
3   A      0
4   B     -4
5   C     -1
6   A      2
7   B      4
8   C     -5
9   A      0
10  B      1
11  C     -1

My goal is to create a third column with quantile labels based on custom quantiles ranges. These ranges are calculated on GroupBy objects based the first column. My use case is to calculate the bottom 10% and the top 10% values (i.e. deciles) for each group and then label them in a new column accordingly: 'bottom_decile', 'mid_deciles', 'top_decile'

Desired outcome:

    A  value              C
0   A     -5  bottom_decile
1   B      1    mid_deciles
2   C      0     top_decile
3   A      0    mid_deciles
4   B     -4  bottom_decile
5   C     -1    mid_deciles
6   A      2     top_decile
7   B      4     top_decile
8   C     -5  bottom_decile
9   A      0    mid_deciles
10  B      1    mid_deciles
11  C     -1    mid_deciles

This is my attempt:

df['C'] = df.groupby(['A'])['value'].transform(lambda x: pd.qcut(x, [0, 0.1, 0.9, 1], labels=['bottom_decile', 'mid_deciles', 'top_decile']))

However, the result raises an error:

ValueError: Bin edges must be unique: array([-5. , -3.8,  2. ,  2. ]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Mark P
  • 101
  • 9

2 Answers2

1

You can drop duplicate edges:

df['C'] = df.groupby('A')['value'].transform(pd.qcut,
                                   q=[0, 0.1, 0.9, 1],
                                   labels=['bottom_decile', 'mid_deciles', 'top_decile'],
                                   duplicates='drop')

print(df.head())

Output:

   A  value              C
0  A     -5  bottom_decile
1  B      1    mid_deciles
2  C      0     top_decile
3  A      0    mid_deciles
4  B     -4  bottom_decile
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
-1

I'm a beginner but I'm trying to practice so that is my solution:

import pandas as pd
import numpy as np

a = list("ABC") * 4
value = np.random.randint(-5, 5, 12)
df = pd.DataFrame({"A": a, "value": value})
print(df)

 A  value
0   A      4
1   B     -2
2   C      1
3   A     -3
4   B     -3
5   C      1
6   A     -2
7   B      1
8   C      4
9   A     -2
10  B      0
11  C     -1

Then I wrote this function:

def func(x):
   return pd.DataFrame(pd.qcut(x, [0.0, 0.1, 0.9, 1], labels=['bottom_decile', 'mid_deciles', 'top_decile']))

Then I did that:

df['C'] = func(df.groupby(['A'])['value'].head(len(df.index)))

Then the final output:

df.head()
A   value   C
0   A   4   top_decile
1   B   -2  mid_deciles
2   C   1   mid_deciles
3   A   -3  bottom_decile
4   B   -3  bottom_decile

Or you can remove duplicates as ValueError said

I hope it works well!