2

Objective: Finding the lowest n values of each row from a matrix or data frame. For this example we want to find the 3 lowest values of each row. We want to return a matrix with

rowname | colname_min | value_min | colname_min2 | value_min2 | colname_min3 | value_min3

Point of departure: I modified the answer from this question: R getting the minimum value for each row in a matrix, and returning the row and column name

Here is my modified code:

    df<-data.frame(matrix(data=round(x=rnorm(100,10,1),digits=3),nrow=10),
               row.names=c("A","B","C","D","E","F","G","H","I","J"))
    colnames(df)<-c("AD","BD","CD","DD","ED","FD","GD","HD","ID","JD")

   result <- t(sapply(seq(nrow(df)), function(i) {
  j <- apply(df, 1, function(x){order(x, decreasing=F)[1:3]})
  c(rownames(df)[i], colnames(df)[j[1,i]], as.numeric(df[i,j[1,i]]),
    colnames(df)[j[2,i]], as.numeric(df[i,j[2,i]]),
    colnames(df)[j[3,i]], as.numeric(df[i,j[3,i]]))
}))

This is working, and it is working fine for the small example data.frame. However, the data.frame I am working with has 200,000 rows and 300 columns. On my machine the code now runs for ~1 hour and is still working. Any ideas how to optimize the code? I was thinking dplyr, but couldn't find a solution. Help is greatly appreciated.

Community
  • 1
  • 1
Daniel Schultz
  • 320
  • 2
  • 13

1 Answers1

3

You can use this base R solution, which orders each row and picks n.min lowest values and their indices:

Sample Data

N <- 5
n.min <- 2

set.seed(1)
m <- matrix(runif(N^2),N)
rownames(m) <- letters[1:N]
colnames(m) <- LETTERS[1:N]

#           A          B         C         D         E
# a 0.2655087 0.89838968 0.2059746 0.4976992 0.9347052
# b 0.3721239 0.94467527 0.1765568 0.7176185 0.2121425
# c 0.5728534 0.66079779 0.6870228 0.9919061 0.6516738
# d 0.9082078 0.62911404 0.3841037 0.3800352 0.1255551
# e 0.2016819 0.06178627 0.7698414 0.7774452 0.2672207

Code

f <- function(rw) {
  O <- order(rw)[1:n.min]
  rbind(O,rw[O])
}

result <- t(apply(m,1,f))

Output for the sample data

#   [,1]       [,2] [,3]      [,4]
# a    3 0.20597457    1 0.2655087
# b    3 0.17655675    5 0.2121425
# c    1 0.57285336    5 0.6516738
# d    5 0.12555510    4 0.3800352
# e    2 0.06178627    1 0.2016819

Update

If you'd like to replace column numbers by the column names, you could additionaly do:

d <- as.data.frame(result)
d[,c(T,F)] <- colnames(m)[unlist(d[,c(T,F)])]

Note that it this way, you avoid coercion of numbers to strings and the subsequent backward conversion to the numeric format, which might be costly for large data sets.

Marat Talipov
  • 13,064
  • 5
  • 34
  • 53