3

I have data that looks like this:

    score        temp
1 a.score  0.05502011
2 b.score  0.02484594
3 c.score -0.07183767
4 d.score -0.06932274
5 e.score -0.15512460

I want to sort the sames based on the values from most negative to most positive, taking the top 4. I try:

> topfour.values <- apply(temp.df, 2, function(xx)head(sort(xx), 4, na.rm = TRUE, decreasing = FALSE))
> topfour.names  <- apply(temp.df, 2, function(xx)head(names(sort(xx)), 4, na.rm = TRUE))
> topfour        <- rbind(topfour.names, topfour.values)

and I get

> topfour.values
                        temp[, 1]                           
    d.score              "-0.06932274"            
    c.score              "-0.0718376680"          
    e.score              "-0.1551246"             
    b.score              " 0.02484594"   

What order is this? What did I do wrong and how do I get it sorted properly?

I've tried method == "Quick" and method == "Shell" as options, but the order still doesn't make sense.

Paulo E. Cardoso
  • 5,778
  • 32
  • 42
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • 1
    http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns-in-r/6871968#6871968 – Ari B. Friedman Jun 02 '14 at 20:44
  • Thanks, Ari. I'm just sorting by one column though, do I really have to resort to plyr? Does sort not actually put things in any particular order? – Hack-R Jun 02 '14 at 20:48
  • 1
    `sort` sorts vectors. Because you want to sort a vector by another, you want `order`. And you don't have to resort to `plyr`: there are a bajillion other ways to do it at that link. – Ari B. Friedman Jun 02 '14 at 20:54
  • OK, thanks. So on matrices and data frames sort just acts kind of randomly or something? Thanks for the link/info. – Hack-R Jun 02 '14 at 20:57

2 Answers2

3

It is my belief that you are getting your data in the wrong type. It would be useful to know how you are getting your data into R. In the example above you are handling a character vector not a numeric one.

head(with(df, df[order(temp), ]), 4)
    score        temp
5 e.score -0.15512460
3 c.score -0.07183767
4 d.score -0.06932274
2 b.score  0.02484594

Taking the proposed approach from Greg Snow, and considering that you are only interested in the vector of top values, and it is impossible to use the partial argument in this case, a simple speed test on comparing order and sorl.list shows that the differences may be irrelevant, even for a 1e7 size vector.

df1 <- data.frame(temp = rnorm(1e+7),
                  score = sample(letters, 1e+7, rep = T))

library(microbenchmark)
microbenchmark(
  head(with(df1, df1[order(temp), 1]), 4),
  head(with(df1, df1[sort.list(temp), 1]), 4),
  head(df1[order(df1$temp), 1], 4),
  head(df1[sort.list(df1$temp), 1], 4),
  times = 1L
  )

Unit: seconds
                                        expr      min       lq   median       uq      max neval
     head(with(df1, df1[order(temp), 1]), 4) 13.42581 13.42581 13.42581 13.42581 13.42581     1
 head(with(df1, df1[sort.list(temp), 1]), 4) 13.80256 13.80256 13.80256 13.80256 13.80256     1
            head(df1[order(df1$temp), 1], 4) 13.88580 13.88580 13.88580 13.88580 13.88580     1
        head(df1[sort.list(df1$temp), 1], 4) 13.13579 13.13579 13.13579 13.13579 13.13579     1
Paulo E. Cardoso
  • 5,778
  • 32
  • 42
  • Thank you. That's perfect. I will mark this as the answer if no one posts a better one (it's too soon to mark as the answer). By the way, could you clarify the difference between df and temp? In my example the name of the dataframe was temp.df and it showed a column name of temp[,1] because temp.df was the first column of an object I had named temp. – Hack-R Jun 02 '14 at 20:50
  • Temp was a matrix in my first attempt then I made it a data frame and got the same results – Hack-R Jun 02 '14 at 20:56
  • The data comes from SQL Server via sqlQuery. > class(temp.df) [1] "data.frame" – Hack-R Jun 02 '14 at 21:04
  • @NerdLife You're being very sloppy with names. In your comment above, you say your data was `temp.df`, and your post shows a column named `temp`. Also in your comment, you say `temp[, 1]` is a column, which is true, it's the first column of some possibly bigger object named `temp`, but it can't be inside `temp.df`... Then you say `temp` was a matrix. Ari is referring to the one `temp` that we've seen in your post, which is a column in the data.frame (`temp.df`), and the question is about the class of that column. – Gregor Thomas Jun 02 '14 at 21:04
2

There are several problems, some of which have been discussed in the comments, but one big one that I have not seen mentioned yet is that the apply function works on matrices and therefore converts your data frame to a matrix before doing anything else. Since your data has both a factor and a numeric variable the numbers are converted to character strings and the sorting is done on the character string representation, not the numerical value. Using the tools that work directly with data frames (and lists) will prevent this as well as using order and avoiding apply altogether.

Also, if you only want the $n$ largest or smallest values then you may be able to speed things up a little by using sort.list instead of order and specifying the partial argument.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • any effect of sort.list speed-up will be sensible for very, very large vectors right? – Paulo E. Cardoso Jun 02 '14 at 21:27
  • @PauloCardoso, the speedup will be much more visible on large vectors than small vectors. – Greg Snow Jun 02 '14 at 21:33
  • By opting by sort.list in this particular case, what would be the partial argument? – Paulo E. Cardoso Jun 02 '14 at 21:41
  • 1
    @PauloCardoso, to get the 1st 4 values (lowest values) then use `partial=1:4`. But I just tried it and the partial argument is not implemented for `sort.list` yet, so until that is implemented there really is no speed advantage. The partial argument works with `sort`, but that only sorts a single vector, not a data frame or the like. – Greg Snow Jun 02 '14 at 21:57