2

Say I have the following data frame:

df <- data.frame(store       = LETTERS[1:8],
                 sales       = c(  9, 128,  54,  66,  23, 132,  89,  70),
                 successRate = c(.80, .25, .54, .92, .85, .35, .54, .46))

I want to rank the stores according to successRate, with ties going to the store with more sales, so first I do this (just to make visualization easier):

df <- df[order(-df$successRate, -df$sales), ]

In order to actually create a ranking variable, I do the following:

df$rank <- ave(df$successRate, FUN = function(x) rank(-x, ties.method='first'))

So df looks like this:

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85    2
1     A     9        0.80    3
7     G    89        0.54    4
3     C    54        0.54    5
8     H    70        0.46    6
6     F   132        0.35    7
2     B   128        0.25    8

The problem is I don't want small stores to be part of the ranking. Specifically, I want stores with less than 50 sales not to be ranked. So this is how I define df$rank instead:

df$rank <- ifelse(df$sales < 50, NA,
                  ave(df$successRate, FUN = function(x) rank(-x, ties.method='first')))

The problem is that even though this correctly removes stores E and A, it doesn't reassign the rankings they were occupying. df looks like this now:

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85   NA
1     A     9        0.80   NA
7     G    89        0.54    4
3     C    54        0.54    5
8     H    70        0.46    6
6     F   132        0.35    7
2     B   128        0.25    8

I've experimented with conditions inside and outside ave(), but I can'r get R to do what I want! How can I get it to rank the stores like this?

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85   NA
1     A     9        0.80   NA
7     G    89        0.54    2
3     C    54        0.54    3
8     H    70        0.46    4
6     F   132        0.35    5
2     B   128        0.25    6
Waldir Leoncio
  • 10,853
  • 19
  • 77
  • 107

1 Answers1

4

Super easy to do with data.table:

library(data.table)
dt = data.table(df)

# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]

dt[sales >= 50, rank := .I]
dt
#   store sales successRate rank
#1:     D    66        0.92    1
#2:     E    23        0.85   NA
#3:     A     9        0.80   NA
#4:     G    89        0.54    2
#5:     C    54        0.54    3
#6:     H    70        0.46    4
#7:     F   132        0.35    5
#8:     B   128        0.25    6

If you must do it in data.frame, then after your preferred order, run:

df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Awesome! I wonder if there's solution that doesn't require using `data.table`. – Waldir Leoncio Aug 09 '13 at 18:39
  • @wleoncio there is, see edit; I don't use `data.frame` anymore though and the above is just one in a multitude of examples of why :) – eddi Aug 09 '13 at 18:42
  • Nice edit, but the `data.frame` solution is not ordering by successRate, but by sales. ;) – Waldir Leoncio Aug 09 '13 at 18:43
  • 1
    @wleoncia those two lines assume data is already ordered as per OP – eddi Aug 09 '13 at 18:44
  • Oh, right, I got things mixed up here. I confess I've never used `data.table` and I'm a little resistant to use it in old scripts, but I've been reading so many wonderful things about it I'll study the package and try to apply it on my future projects. Thanks for the help! – Waldir Leoncio Aug 09 '13 at 18:46