0

I want to divide my data into quantiles (quintiles to be more specific). However, when using qcut in Pandas or ntile in R, I either get the error that "bin edges must be unique" (qcut) or equal values get assigned to different quintiles (ntile).

Looking at the following data: [1, 8, 2, 1, 8] I would like the ouput to be:

  • 1 assigned to quintile 1
  • 8 assigned to quintile 5
  • 2 assigned to quintile 3
  • 1 assigned to quintile 1 again
  • 8 assigned to quintile 5 again

In this specific case the 1s could also be assigned to quintile 2 instead of 1 (depending on definition). It is just important that equal values are assigned to the same quintile.

Do you have any ideas on how I can achieve this?

Thank you very much for your input!

Some data to use:

import pandas as pd
df = pd.DataFrame({'Date': ['2011-01-01', '2011-01-02', '2011-01-03','2011-01-01', '2011-01-02', 
'2011-01-03','2011-01-01', '2011-01-02', '2011-01-03', '2011-01-01','2011-01-02', '2011-01-03',
'2011-01-01', '2011-01-02', '2011-01-03'], 'Name': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'E', 'E', 'E'], 
'Value':[1, 3, 2, 8, 6, 5, 2, 10, 4, 1, 5, 3, 8, 4, 9]})

Code I used in R:

df = df %>% group_by(Date) %>% mutate(Quint = ntile(Value, 5)) 

This assigned qual values to different bins.

Code I used in Python:

df['Quintiles'] = df.groupby(by=["Date"])['Value'].transform(lambda x: pd.qcut(x, 5, labels=["XS", "S", "M", "L", "XL"]))

This resulted in the error of bin edges having to be unique.

Quack
  • 13
  • 4

2 Answers2

2

Update

In this specific case the 1s could also be assigned to quintile 2 instead of 1 (depending on definition). It is just important that equal values are assigned to the same quintile.

cut_rank = lambda x: pd.cut(x.rank(method='min'), 5, labels=["XS", "S", "M", "L", "XL"])
df['Quintiles'] = df.groupby(by=["Date"])['Value'].transform(cut_rank)

Output:

>>> df[df['Date'] == '2011-01-01']
          Date Name  Value Quintiles
0   2011-01-01    A      1        XS
3   2011-01-01    B      8        XL
6   2011-01-01    C      2         L
9   2011-01-01    D      1        XS
12  2011-01-01    E      8        XL

>>> df
          Date Name  Value Quintiles
0   2011-01-01    A      1        XS
1   2011-01-02    A      3        XS
2   2011-01-03    A      2        XS
3   2011-01-01    B      8        XL
4   2011-01-02    B      6         L
5   2011-01-03    B      5         L
6   2011-01-01    C      2         L
7   2011-01-02    C     10        XL
8   2011-01-03    C      4         M
9   2011-01-01    D      1        XS
10  2011-01-02    D      5         M
11  2011-01-03    D      3         S
12  2011-01-01    E      8        XL
13  2011-01-02    E      4         S
14  2011-01-03    E      9        XL

Old answer

You have to use rank with method='first' on each group before qcut:

qcut_rank = lambda x: pd.qcut(x.rank(method='first'), 5, labels=["XS", "S", "M", "L", "XL"])
df['Quintiles'] = df.groupby(by=["Date"])['Value'].transform(qcut_rank)

Output:

>>> df
          Date Name  Value Quintiles
0   2011-01-01    A      1        XS
1   2011-01-02    A      3        XS
2   2011-01-03    A      2        XS
3   2011-01-01    B      8         L
4   2011-01-02    B      6         L
5   2011-01-03    B      5         L
6   2011-01-01    C      2         M
7   2011-01-02    C     10        XL
8   2011-01-03    C      4         M
9   2011-01-01    D      1         S
10  2011-01-02    D      5         M
11  2011-01-03    D      3         S
12  2011-01-01    E      8        XL
13  2011-01-02    E      4         S
14  2011-01-03    E      9        XL

All explanation are here

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Based on your link in the answer, should this be marked as a duplicate? – ramzeek Feb 07 '22 at 01:18
  • @wikikikitiki. At first intention, I closed the answer and I reopened it after because the groupy_transform. – Corralien Feb 07 '22 at 06:29
  • Unfortunately, this does not solve my problem. Equal values are still assigned to different bins. (See 2011-01-01 for example: the first 8 goes to L the second 8 to XL, same for the first 1 going to XS and the second to S). Is there a way for the 8s to both go to XL, then L is skipped as 40% are already assigned, and then both 1s are assigned to S with XS being skipped again in this case? Only if the 5 numbers completely differ from each other I want them to be assigned to different bins. If they are equal they should all go to the same bin. – Quack Feb 07 '22 at 08:18
  • @Quack. I updated my answer. Can you check it please? – Corralien Feb 07 '22 at 08:35
  • @Corralien Thank you, this is better but is there a way to put 2 into M and the 1s into S? My intuition for the example would be: 40% of the values are in XL (2 out of 5) as the 8s are of equal value. The threshold for L is 80-60% so it will be skipped in this case as 100-60% are already in XL and the next value should go into M (threshold: 60-40%). For the last remaining values (the 1s) they should go into S as S represents 40-20% but they are of equal value. As all values would be assigned then, XS should not appear in the example case. – Quack Feb 07 '22 at 10:15
  • I now tried: ```df['Percentile Rank'] = df.groupby('Date').Value.rank(pct = True)``` which almost gets the correct percentiles. However, instead of assigning the 8s to 100%, it assigns them to 90% and the 1s to 30% instead of 40%. The 2 is correctly assigned to 60%. – Quack Feb 07 '22 at 10:38
  • Do you always have 5 values to rank and 5 labels? – Corralien Feb 07 '22 at 11:00
  • @Corralien No, my actual dataset has about 800 observations per period that need to be ranked. 5 labels would be ideal, but you could also combine XS and S to just Small and XL and L to just Large. I now tried ```df['Percentile Rank2'] = df.groupby('Date').Value.rank(method='max', pct = True)``` which seems to assign the groups correctly. Will this hold for large datasets? – Quack Feb 07 '22 at 11:30
0

A partial answer as it's looking at numpy's descriptions of percentile interpolation methods, and not applying directly to pandas, but in the manual above, you can see the different options for interpolation of the percentiles for a discontinuous dataset.

import numpy as np

x =  [1, 8, 2, 1, 8]

discontinuous_methods = ['linear',
                         'lower',
                         'higher',
                         'midpoint',
                         'nearest']

for method in discontinuous_methods:
    n = [f'{np.percentile(x, i, interpolation = method):.2f}' for i in [20, 40, 60, 80]]
    print(f'{method:8s} - {n[0]:3s}, {n[1]:3s}, {n[2]:3s}, {n[3]:3s}')

linear   - 1.00, 1.60, 4.40, 8.00
lower    - 1.00, 1.00, 2.00, 8.00
higher   - 1.00, 2.00, 8.00, 8.00
midpoint - 1.00, 1.50, 5.00, 8.00
nearest  - 1.00, 2.00, 2.00, 8.00
ramzeek
  • 2,226
  • 12
  • 23