4

I have many scores in the column of an object named example. I want to split these scores into deciles and assign the corresponding decile interval to each row. I tried the following:

import random
import pandas as pd
random.seed(420) #blazeit
example = pd.DataFrame({"Score":[random.randrange(350, 1000) for i in range(1000)]})
example["Decile"] = pd.qcut(example["Score"], 10, labels=False) + 1 # Deciles as integer from 1 to 10
example["Decile_interval"] = pd.qcut(example["Score"], 10) # Decile as interval

This gives me the deciles I'm looking for. However, I would like the deciles in example["Decile_interval"] to be integers, not floats. I tried precision=0 but it just shows .0 at the end of each number.

How can I transform the floats in the intervals to integers?

EDIT: As pointet out by @ALollz, doing this will change the decile distribution. However, I am doing this for presentation purposes, so I am not worried by this. Props to @JuanC for realizing this and posting one solution.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
  • 1
    Well if you round the endpoints to integers you'll no longer have deciles... So what's more important? – ALollz Sep 09 '19 at 15:16
  • @ALollz I'd rather have rounded intervals than exact deciles. An alternative would be to create a new column that simply printed the intervals as integers while keeping the true values in the original column. – Arturo Sbr Sep 09 '19 at 15:25

2 Answers2

4

This is my solution using a simple apply function:

example["Decile_interval"] = example["Decile_interval"].apply(lambda x: pd.Interval(left=int(round(x.left)), right=int(round(x.right))))
Massifox
  • 4,369
  • 11
  • 31
2

There might be a better solution, but this works:

import numpy as np

int_categories= [pd.Interval(int(np.round(i.left)),int(np.round(i.right))) for i in example.Decile_interval.cat.categories]
example.Decile_interval.cat.categories = int_categories

Output:

0      (350, 418]
1      (680, 740]
2      (606, 680]
3      (740, 798]
4      (418, 474]
5      (418, 474]
.           .
Juan C
  • 5,846
  • 2
  • 17
  • 51
  • The only issue is that `pd.qcut` is slightly smarter and knows to change the left most bin to be 349.999, that way `350` gets grouped and not excluded. – ALollz Sep 09 '19 at 15:28
  • 1
    It seems this change is mostly for presentation purposes so total accuracy of the intervals isn't very relevant to OP, but that's a good point nevertheless – Juan C Sep 09 '19 at 15:31
  • @ALollz That's right, this is more for presentation purposes. This solution works. – Arturo Sbr Sep 09 '19 at 15:42