5

I'm quite new to R, I use it mainly for visualising statistics using ggplot2 library. Now I have faced a problem with data preparation.

I need to write a function, that will remove some number (2, 5 or 10) rows from a data frame that have highest and lowest values in specified column and put them into another data frame, and do this for each combination of two factors (in my case: for each day and server).

Up to this point, I have done the following steps (MWE using esoph example dataset).

I have sorted the frame according to the desired parameter (ncontrols in example):

esoph<-esoph[with(esoph,order(-ncontrols)) ,]

I can display first/last records for each factor value (in this example for each age range):

by(data=esoph,INDICES=esoph$agegp,FUN=head,3)
by(data=esoph,INDICES=esoph$agegp,FUN=tail,3)

So basically, I can see the highest and lowest values, but I don't know how to extract them into another data frame and how to remove them from the main one.

Also in the above example I can see top/bottom records for each value of one factor (age range), but in reality I need to know highest and lowest records for each value of two factors -- in this example they could be agegp and alcgp.

I am not even sure if these above steps are OK - perhaps using plyr would work better? I'd appreciate any hints.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Paweł Rumian
  • 3,676
  • 3
  • 21
  • 27
  • So you simply want to remove the first and last X rows of a data frame and create a second data frame that contains these rows? – Ernest A Nov 16 '12 at 10:57
  • Not simply first and last, but highest and lowest values (for value from one column) for each combination of two factors. So for two days and two servers I need top and lowest 5 for server1 and server2 in day1, and top and lowest 5 for server1 and server2 in day2. – Paweł Rumian Nov 16 '12 at 13:53

2 Answers2

3

Yes, you can use plyr as follows:

esoph <- data.frame(agegp = sample(letters[1:2], 20, replace = TRUE),
                    alcgp = sample(LETTERS[1:2], 20, replace = TRUE),
                    ncontrols = runif(20))

ddply(esoph, c("agegp", "alcgp"),
      function(x){idx <- c(which.min(x$ncontrols),
                           which.max(x$ncontrols))
                  x[idx, , drop = FALSE]})
#   agegp alcgp  ncontrols
# 1     a     A 0.03091483
# 2     a     A 0.88529790
# 3     a     B 0.51265447
# 4     a     B 0.86111649
# 5     b     A 0.28372232
# 6     b     A 0.61698401
# 7     b     B 0.05618841
# 8     b     B 0.89346943

ddply(esoph, c("agegp", "alcgp"),
      function(x){idx <- c(which.min(x$ncontrols),
                           which.max(x$ncontrols))
                  x[-idx, , drop = FALSE]})
#    agegp alcgp ncontrols
# 1      a     A 0.3745029
# 2      a     B 0.7621474
# 3      a     B 0.6319013
# 4      b     A 0.3055078
# 5      b     A 0.5146028
# 6      b     B 0.3735615
# 7      b     B 0.2528612
# 8      b     B 0.4415205
# 9      b     B 0.6868219
# 10     b     B 0.3750102
# 11     b     B 0.2279462
# 12     b     B 0.1891052

There are possibly many alternatives, e.g. using head and tail if your data is already sorted, but this should work.

flodel
  • 87,577
  • 21
  • 185
  • 223
1

Using base R:

newesoph <- esoph[esoph$ncontrols == ave(esoph$ncontrols,list(esoph$agegp,esoph$alcgp),FUN = max) 
        | esoph$ncontrols == ave(esoph$ncontrols,list(esoph$agegp,esoph$alcgp),FUN = min), ]
ARobertson
  • 2,857
  • 18
  • 24