4

I'm using the cut function to split my data into groups using the max/min range. here is an example of the code that I am using:

# sample data frame - used to identify intial groups
testdf <- data.frame(a = c(1:100), b = rnorm(100))

# split into groups based on ranges 
k <- 20 # number of groups
# split into groups, keep code
testdf$groupCode <- cut(testdf$b, breaks = k, labels = FALSE)
# store factor information 
testdf$group <- cut(testdf$b, breaks = k)                     
head(testdf)

I want to use the factor groupings identified to split another data frame up, but I'm not sure how to use factors to deal with this. I think my code structure should be roughly as follows:

# this is the data I want to categorize based on previous groupings
datadf <- data.frame(a = c(1:100), b = rnorm(100))
datadf$groupCode <- function(x){return(groupCode)}

I see that the factor data is structure as follows, but I don't know how to use it properly:

testdf$group[0]
factor(0)
20 Levels: (-2.15,-1.91] (-1.91,-1.67] (-1.67,-1.44] (-1.44,-1.2]  ... (2.34,2.58]

Two functions that I have been experimenting with (but which do not work) are as follows:

# get group code 
nearestCode <- function( number, groups ){
  return( which( abs( groups-number )== min( abs(groups-number) ) ) )  
}
nearestCode(7, testdf$group[0])

And also experimenting with the which function.

which(7, testdf$group[0])

What is the best way of identifying groupings and applying them to another dataframe?

djq
  • 14,810
  • 45
  • 122
  • 157
  • Minor points: `return 1` is a typo maybe? And are you really using `lapply` to assign a single value to a column, or did you mean something else? – joran Aug 09 '11 at 15:36
  • @joran - good points; sorry `return 1` was a typo, but I was trying to sketch out the pseudocode; `lapply` was a misunderstanding of its use which I edited out. – djq Aug 09 '11 at 15:40
  • 1
    If you just want to split the range of the data into equal lengths then use `span<-diff(range(x)); breaks = seq.int(min(x)-span/1000, max(x)+span/1000, by=span/n)` . Then you have a numeric vector to save. That is sort of how `cut` does it and you can type `cut.default` to see the actual code. – IRTFM Aug 09 '11 at 16:31

2 Answers2

7

I would have used:

testdf$groupCode <- cut(testdf$b, breaks = 
                           quantile(testdf$b, seq(0,1, by=0.05), na.rm=TRUE))
grpbrks <- quantile(testdf$b, seq(0,1, by=0.05), na.rm=TRUE)

Then you can use:

 findInterval(newdat$newvar, grpbrks)   # to group new data

And you then won't need to screw around with recovering the breaks from the labels or the data.

Thinking about, I guess you could also use:

 cut(newdat$newvar, grpbrks)  # more isomorphic to original categorization I suppose
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks @DWin, that is very compact. I am trying to avoid using quantiles though - is there any way I can define the breaks using a range so that I can use `findInterval` ? – djq Aug 09 '11 at 15:46
  • Just answering my own sub-question: `step = (max(testdf$b) - min(testdf$b))/k ; breaks = rep(1:k * breaks)` – djq Aug 09 '11 at 16:22
  • I guess we collided. I put something very similar in the comments to your question. You will probably be happier if you round the vector. – IRTFM Aug 09 '11 at 16:34
2

Screwing around with some regular expressions seems to be the only way of actually returning the value of an object resulting from cut.

The following code does the necessary screwing:

cut_breaks <- function(x){
  first <- as.numeric(gsub(".{1}(.+),.*", "\\1", levels(x))[1])
  other <- as.numeric(gsub(".+,(.*).{1}", "\\1", levels(x)))
  c(first, other)
}

set.seed(1)
x <- rnorm(100)

cut1 <- cut(x, breaks=20)
cut_breaks(cut1)
 [1] -2.2200 -1.9900 -1.7600 -1.5300 -1.2900 -1.0600 -0.8320 -0.6000 -0.3690
[10] -0.1380  0.0935  0.3250  0.5560  0.7870  1.0200  1.2500  1.4800  1.7100
[19]  1.9400  2.1700  2.4100

levels(cut1)
 [1] "(-2.22,-1.99]"   "(-1.99,-1.76]"   "(-1.76,-1.53]"   "(-1.53,-1.29]"  
 [5] "(-1.29,-1.06]"   "(-1.06,-0.832]"  "(-0.832,-0.6]"   "(-0.6,-0.369]"  
 [9] "(-0.369,-0.138]" "(-0.138,0.0935]" "(0.0935,0.325]"  "(0.325,0.556]"  
[13] "(0.556,0.787]"   "(0.787,1.02]"    "(1.02,1.25]"     "(1.25,1.48]"    
[17] "(1.48,1.71]"     "(1.71,1.94]"     "(1.94,2.17]"     "(2.17,2.41]"    

You can then pass these break values to cut using the breaks= parameter to make your second cut.

Andrie
  • 176,377
  • 47
  • 447
  • 496