extracting rows of group medians n R

Question

If I have a data frame like the following:

v2 <- c(4.5, 2.5, 3.5, 5.5, 7.5, 6.5, 2.5, 1.5, 3.5) 
v1 <- c(2.2, 3.2, 1.2, 4.2, 2.2, 3.2, 2.2, 1.2, 5.2) 
lvl <- c("a","a","a","b","b","b","c","c","c") 
d <- data.frame(v1,v2,lvl) 

> d
   v1  v2 lvl
1 2.2 4.5   a
2 3.2 2.5   a
3 1.2 3.5   a
4 4.2 5.5   b
5 2.2 7.5   b
6 3.2 6.5   b
7 2.2 2.5   c
8 1.2 1.5   c
9 5.2 3.5   c

Within each level of d$lvl, I want to extract the row with value of d$v1 being median (for the simplest case, each level of d$lvl has three rows). So I want to get:

   v1  v2 l 
1 2.2 4.5 a 
6 3.2 6.5 b 
7 2.2 2.5 c

What happens if the median isn't in v1 (e.g. if your group has even row numbers)? — sgibb, Sep 17 '13 at 08:01
you are right. this is just the simplest case... majority of my data set has three elements for each level, but there are a few with 2 elements and 1 with 1 element. — user2783615, Sep 17 '13 at 08:48

score 1 · Answer 1 · answered Sep 17 '13 at 06:50

1

There are a couple of ways you can do this:

Check out the plyr package, which is really useful for operating on subsets of data:

library(plyr)
ddply(d, .(lvl), summarize, v1 = median(v1), v2 = median(v2))

Or if you're comfortable with SQL queries, you can use the sqldf package:

library(sqldf)
sqldf("SELECT median(v1) as v1, median(v2) as v2, lvl FROM d GROUP BY lvl")

answered Sep 17 '13 at 06:50

Scott Ritchie

10,293
3
28
64

Thanks, just installed the package. But I am having a hard time to figure out: I am not trying to summarize stats in each group; but I want to extract rows that are the medians for v1 in each group -- extracting a subset of the data – user2783615 Sep 17 '13 at 07:11

score 1 · Accepted Answer · answered Sep 17 '13 at 08:56

For groups with odd number of rows this works. You need to think of how to handle groups with even number of rows. For example you may wish to round the median in one or the other direction, see ?round.

library(plyr)
d2 <- ddply(.data = d, .variables = .(lvl), function(x)
  x[which(x$v1 == median(x$v1)), ])

#    v1  v2 lvl
# 1 2.2 4.5   a
# 2 3.2 6.5   b
# 3 2.2 2.5   c

score 0 · Answer 3 · answered Sep 17 '13 at 09:01

First, calculate the median of v1 by lvl with the function ddply (rounded with 1 decimal)

(install.packages("plyr")
 df <- ddply(d, .(lvl), summarize, v1 = round(median(v1),1))

Second, merge the original df (d) with the calculated one (df), the merge compares where lvl and v1 are the same in the original data (d) and only takes those rows

 df1 <- merge(df, d, by = c("lvl","v1"))

View(df1)
  lvl  v1  v2
1   a 2.2 4.5
2   b 3.2 6.5
3   c 2.2 2.5

score 0 · Answer 4 · answered Sep 17 '13 at 09:55

I like to present a approach that is working with odd and even numbers of rows:

## example data
v2 <- c(4.5, 2.5, 3.5, 5.5, 7.5, 6.5, 2.5, 1.5, 3.5, 1, 1, 1, 1) 
v1 <- c(2.2, 3.2, 1.2, 4.2, 2.2, 3.2, 2.2, 1.2, 5.2, 1.5, 2.5, 3.5, 4.5) 
lvl <- c("a","a","a","b","b","b","c","c","c", "d", "d", "d", "d")
d <- data.frame(v1,v2,lvl)

## define own median index function
medIdx <- function(x) {
  n <- length(x)
  ## even: p == n/2
  ## odd:  p == (n+1)/2
  p <- ceiling(n/2)
  return(which(x == sort(x, partial=p)[p])[1])
}

## run blockwise (blocks defined by d$lvl) and bind results
do.call(rbind, by(d, INDICES=d$lvl, FUN=function(x){ return(x[medIdx(x$v1), ]) }))

#   v1  v2 lvl
#a 2.2 4.5   a
#b 3.2 6.5   b
#c 2.2 2.5   c
#d 2.5 1.0   d

This is nice, Thanks sgibb! – user2783615 Sep 18 '13 at 01:51 — user2783615, Sep 18 '13 at 01:51

extracting rows of group medians n R

4 Answers4

Linked