How to bin ordered data by percentile for each id in R dataframe [r]

Question

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).

My current dataframe looks like this:

I'd like my new dataframe to look like this:

id   RT    Bin
7000  225    1
7000  250    1

After getting my data to look like this, I will aggregate by id and bin

The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.

score 4 · Accepted Answer · answered Oct 06 '11 at 15:43

The answer @Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc

library("plyr")
library("Hmisc")

dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))

tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))

tmp now has what you want

> tmp
    id       value hists
1    1  0.19016791     3
2    1  0.27795226     4
3    1  0.74350982     5
4    1  0.43459571     4
5    1 -2.72263322     1
....
95  10 -0.10111905     3
96  10 -0.28251991     2
97  10 -0.19308950     2
98  10  0.32827137     4
99  10 -0.01993215     4
100 10 -1.04100991     1

With the same number in each hists for each id

> table(tmp$id, tmp$hists)

     1 2 3 4 5
  1  2 2 2 2 2
  2  2 2 2 2 2
  3  2 2 2 2 2
  4  2 2 2 2 2
  5  2 2 2 2 2
  6  2 2 2 2 2
  7  2 2 2 2 2
  8  2 2 2 2 2
  9  2 2 2 2 2
  10 2 2 2 2 2

yes! I was right, thanks for fleshing this out. (see my comment below if that wasn't clear) — Chase, Oct 06 '11 at 15:44
Sorry, @Chase, I didn't see your comment before I started composing this. — Brian Diggs, Oct 06 '11 at 16:00

score 3 · Answer 2 · answered Oct 06 '11 at 02:40

3

Here's a reproducible example using package plyr and the cut function:

dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))

ddply(dat, "id", transform, hists = cut(value, breaks = 5))

    id       value             hists
1    1 -1.82080027     (-1.94,-1.41]
2    1  0.11035796     (-0.36,0.166]
3    1 -0.57487134    (-0.886,-0.36]
4    1 -0.99455189    (-1.41,-0.886]
....
96  10 -0.03376074    (-0.233,0.386]
97  10 -0.71879488   (-0.853,-0.233]
98  10 -0.17533570    (-0.233,0.386]
99  10 -1.07668282    (-1.47,-0.853]
100 10 -1.45170078    (-1.47,-0.853]

Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.

answered Oct 06 '11 at 02:40

Chase

67,710
18
144
161

Perhaps my question is not clear, for each id I'd like an equal number of trials broken into 5 bins. For example, a person with 80 rts would have 5 bins of 16 trials with bin 1 being the fastest 16 trials and bin 5 being the slowest. Do you know of a way to accomplish this? – Matt Oct 06 '11 at 13:59
@Matt Chase's answer does precisely what you describe. Have you even tried it? – joran Oct 06 '11 at 15:06
I did. I appreciate Chases's solution. But I was unclear in what I was initially asking for. I need each bin to be composed of the same amount of rts. In Chases solution, the hist creates 5 bins but each is composed of a different amount of rts. – Matt Oct 06 '11 at 15:18
1

@Matt - ahh, gotcha, sorry I misunderstood. I dont' have time at the moment, but I'm 99% sure that the `cut2` function in package `Hmisc` has an additional parameter that does precisely this for you, I believe it is the `g` argument. If not, I'll work something up later today. – Chase Oct 06 '11 at 15:35

John · Answer 3 · 2011-10-06T16:28:55.613

0

Here's an answer in plain old R.

#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )

#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)

You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use

quantile(x, seq(0, 1, 0.1))

in the function above.

The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.

library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))

This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

edited Oct 06 '11 at 16:28

answered Oct 06 '11 at 04:10

John

23,360
7
57
83

What are the values returned in this aggregate command? What I want is 5 values for each id with the first value being the mean of that id's fastest x amount of trials, the next bin being their next fastest x amount of trials. Each bin should be composed of same amount of trials (when the total trial count is even). – Matt Oct 06 '11 at 14:08
see revisions that should do what you want – John Oct 06 '11 at 15:17
Thanks, John. I've run your revised solution. What I don't understand is why the rts are not ascending. The mean of the values contained within the cut points should ascend as the cut points get larger, no? – Matt Oct 06 '11 at 15:36
Which solution, the fragile fast one or more robust (faster than ddply) one? – John Oct 06 '11 at 16:12

How to bin ordered data by percentile for each id in R dataframe [r]

3 Answers3