-1

I am trying to find a fast way to do the following:

  1. Determine the yearly quartile values of a database
  2. Compare (match) a specific variable in a database to its yearly quartile value
  3. Depending on the value, create a new variable with value of 0,1,2,3... (rankorder)

Here is a reproducible example

library(data.table)
dt <- data.table(rep(seq.int(2000,2010,1),30), runif(330,0,5))
colnames(dt) <- c("year","response") # Ignore warning

  quarts <- function(x) {
  quantile(x, probs = seq(0.25,0.75,0.25),na.rm=T, names=T)
}
setkey(dt, year)
a <- data.table(dt[,quarts(response), by = key(dt)])

Now data.table a contains the needed quartile values of dt$response for every year. What I need to do now is to compare the value of dt$response with the quartile values in a and create a new variable dt$quartresponse that takes

  • Value 0 if dt$response[i] is smaller than the 0.25 quartile value for that specific year
  • Value 1 if dt$response[i] is between the 0.25 and 0.5 quartile value for that specific year
  • Value 2 if dt$response[i] is between the 0.50 and 0.75 quartile value for that specific year
  • Value 3 otherwise

I'm sure some kind of loop would work but there must be a more R-like way of solving this.

Any suggestions are welcome!

Simon

SJDS
  • 1,239
  • 1
  • 16
  • 31
  • I honestly do not understand why people give minus points to questions in general and especially not when there is a reproducible example! Any explanation would be welcome... – SJDS Mar 17 '15 at 00:48
  • In the new devel version (which I forgot to mention), just `dcast` would be enough. You could install the devel version from https://github.com/Rdatatable/data.table/wiki/Installation – akrun Mar 17 '15 at 02:06

1 Answers1

2

You could use cut to create the rank groupings for each 'year' after joining the 'wide' format of 'a' i.e. 'a1' with 'dt'

library(data.table) #data.table_1.9.5
a1 <- dcast(a[, ind:=paste0('Quart',1:3)], year~ind, value.var='V1')
res <- setkey(a1, year)[dt][, quartresponse:=cut(response, 
          breaks=c(-Inf,Quart1[1L], Quart2[1L], Quart3[1L],Inf), 
                      labels=FALSE)-1, by=year][, 2:4 := NULL]

 head(res,5)
 #   year response quartresponse
 #1: 2000 4.959491             3
 #2: 2000 2.522881             2
 #3: 2000 4.465005             3
 #4: 2000 0.5421316            0
 #5: 2000 2.2328381            1

head(a1,3)
#  year   Quart1   Quart2   Quart3
#1: 2000 1.703482 2.325766 3.867453
#2: 2001 1.395815 1.972565 3.286358
#3: 2002 1.469664 2.151403 3.359189
akrun
  • 874,273
  • 37
  • 540
  • 662