1

I am attempting to create user groups based on a particluar DataFrame column value. I would like to create 10 user groups of the entire DataFrame's population, based on the total_usage metric. An example DataFrame df is shown below.

user_id   total_usage
1         10
2         10
3         20
4         20
5         30
6         30
7         40
8         40
9         50
10        50
11        60
12        60
13        70
14        70
15        80
16        80
17        90
18        90
19        100
20        100

The df is just a snippet of the entire DataFrame which is over 6000 records long, however I would like like to only have 10 user groups.

An example of my desired output is shown below.

user_id   total_usage  user_group
1         10           10th_group
2         10           10th_group
3         20           9th_group
4         20           9th_group
5         30           8th_group
6         30           8th_group
7         40           7th_group
8         40           7th_group
9         50           6th_group
10        50           6th_group
11        60           5th_group
12        60           5th_group
13        70           4th_group
14        70           4th_group
15        80           3th_group
16        80           3th_group
17        90           2nd_group
18        90           2nd_group
19        100          1st_group
20        100          1st_group

Any assistance that anyone could provide would be greatly appreciated.

moe_95
  • 397
  • 2
  • 17
  • 1
    what will be `user_group` if `total_usage` is 64? – rnso May 16 '19 at 11:33
  • @rsno I suppose that would depend on how what the other values in the population would be. My dataset has 6000+ records so I am attempting to order these into groups in a programmatic way. – moe_95 May 16 '19 at 11:39
  • @jezrael In this example yes there is. However my full DataFrame is over 6000+ records with hundreds of unique values. – moe_95 May 16 '19 at 11:40
  • @moe_95 - So if 200 unique values need `1st_group` to `200th_group` ? – jezrael May 16 '19 at 11:41
  • @jezrael No, I was hoping to have the entire population split into 10 groups. My full `df` has 6000 records, with the majority of those being unique (due to decimal place). – moe_95 May 16 '19 at 11:45

3 Answers3

2

Use qcut with changed order by negatives and Series.map for 1.st and 2.nd values:

s =  pd.qcut(-df['total_usage'], np.arange(0,1.1, 0.1), labels=False) + 1
d = {1:'st', 2:'nd'}
df['user_group'] = s.astype(str) + s.map(d).fillna('th') + '_group'
print (df)
    user_id  total_usage  user_group
0         1           10  10th_group
1         2           10  10th_group
2         3           20   9th_group
3         4           20   9th_group
4         5           30   8th_group
5         6           30   8th_group
6         7           40   7th_group
7         8           40   7th_group
8         9           50   6th_group
9        10           50   6th_group
10       11           60   5th_group
11       12           60   5th_group
12       13           70   4th_group
13       14           70   4th_group
14       15           80   3th_group
15       16           80   3th_group
16       17           90   2nd_group
17       18           90   2nd_group
18       19          100   1st_group
19       20          100   1st_group
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks so much for your help, I really appreciate it. When running on my entire DataFrame I am getting the following error `ValueError: Bin edges must be unique: array([-451., -19., -13., -10., -8., -7., -6., -5., -4.,-4., -1.]). You can drop duplicate edges by setting the 'duplicates' kwarg` – moe_95 May 16 '19 at 12:04
2

Looks like you are looking for qcut, but in reverse order

df['user_group'] = 10 - pd.qcut(df['total_usage'], np.arange(0,1.1, 0.1)).cat.codes

Output, it's not ordinal, but I hope it will do:

0     10
1     10
2      9
3      9
4      8
5      8
6      7
7      7
8      6
9      6
10     5
11     5
12     4
13     4
14     3
15     3
16     2
17     2
18     1
19     1
dtype: int8
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks so much for your help. When running on my entire DataFrame I am getting the following error `ValueError: Bin edges must be unique: array([-451., -19., -13., -10., -8., -7., -6., -5., -4.,-4., -1.]). You can drop duplicate edges by setting the 'duplicates' kwarg` – moe_95 May 16 '19 at 12:03
  • 1
    @moe_95 see [this question](https://stackoverflow.com/questions/20158597/how-to-qcut-with-non-unique-bin-edges) – Quang Hoang May 16 '19 at 12:05
  • Thanks very much, I'll take a look now – moe_95 May 16 '19 at 12:08
0

Try using pd.Series with np.repeat, np.arange, pd.DataFrame.groupby, pd.Series.astype, pd.Series.map and pd.Series.fillna:

x = df.groupby('total_usage')
s = pd.Series(np.repeat(np.arange(len(x.ngroups), [len(i) for i in x.groups.values()]) + 1)
df['user_group'] = (s.astype(str) + s.map({1: 'st', 2: 'nd'}).fillna('th') + '_Group').values[::-1]

And now:

print(df)

Is:

    user_id  total_usage  user_group
0         1           10  10th_Group
1         2           10  10th_Group
2         3           20   9th_Group
3         4           20   9th_Group
4         5           30   8th_Group
5         6           30   8th_Group
6         7           40   7th_Group
7         8           40   7th_Group
8         9           50   6th_Group
9        10           50   6th_Group
10       11           60   5th_Group
11       12           60   5th_Group
12       13           70   4th_Group
13       14           70   4th_Group
14       15           80   3th_Group
15       16           80   3th_Group
16       17           90   2nd_Group
17       18           90   2nd_Group
18       19          100   1st_Group
19       20          100   1st_Group
U13-Forward
  • 69,221
  • 14
  • 89
  • 114