3

I have a dataset containing over 6,000 observations, each record having a score ranging from 0-100. Below is a sample:

+-----+-------+
| uID | score |
+-----+-------+
|   1 |    77 |
|   2 |    61 |
|   3 |    74 |
|   4 |    47 |
|   5 |    65 |
|   6 |    51 |
|   7 |    25 |
|   8 |    64 |
|   9 |    69 |
|  10 |    52 |
+-----+-------+

I want to bin them into equal deciles based upon their rank order relative to their peers within the score column with cutoffs being at every 10th percentile, as seen below:

+-----+-------+-----------+----------+
| uID | score | position% | scoreBin |
+-----+-------+-----------+----------+
|   7 |    25 | 0.1       |        1 |
|   4 |    47 | 0.2       |        2 |
|   6 |    51 | 0.3       |        3 |
|  10 |    52 | 0.4       |        4 |
|   2 |    61 | 0.5       |        5 |
|   8 |    64 | 0.6       |        6 |
|   5 |    65 | 0.7       |        7 |
|   9 |    69 | 0.8       |        8 |
|   3 |    74 | 0.9       |        9 |
|   1 |    77 | 1         |       10 |
+-----+-------+-----------+----------+

So far I've tried cut, cut2, tapply, etc. I think I'm on the right logic path, but I have no idea on how to apply them to my situation. Any help is greatly appreciated.

Jrausch2
  • 33
  • 1
  • 5

3 Answers3

9

I would use ntile() in dplyr.

library(dplyr)

score<-c(77,61,74,47,65,51,25,64,69,52)
ntile(score, 10)

##[1] 10  5  9  2  7  3  1  6  8  4

scoreBin<- ntile(score, 10)
Bryan Goggin
  • 2,449
  • 15
  • 17
  • Well, that was deceptively simple. Thanks a ton Bryan! Other answers are great too, so this will be a good reference post for me. – Jrausch2 Jun 06 '16 at 19:35
1

In base R we can use a combination of .bincode() and quantile():

df$new <- .bincode(df$score, 
               breaks = quantile(df$score, seq(0, 1, by = 0.1)),
               include.lowest = TRUE)
#   uID score new
#1    1    77  10
#2    2    61   5
#3    3    74   9
#4    4    47   2
#5    5    65   7
#6    6    51   3
#7    7    25   1
#8    8    64   6
#9    9    69   8
#10  10    52   4
mtoto
  • 23,919
  • 4
  • 58
  • 71
1

Here is a method that uses quantile together with cut to get the bins:

df$scoreBin <- as.integer(cut(df$score,
                      breaks=quantile(df$score, seq(0,1, .1), include.lowest=T)))

as.integer coerces the output of cut (which is a factor) into the underlying integer.

One way to get the position percent is to use rank:

df$position <- rank(df$score) / nrow(df)
lmo
  • 37,904
  • 9
  • 56
  • 69