2

Firstly, I’d like to apologize. I’m learning R on my own, so I couldn’t simplify my problem and decided to just write a short version of my real variables here. I’m trying to implement a variant of the Maximum Likelihood classifier in R. So, I have some variables for each class written in vectors and lists (each position refers to one class), and I want to apply a function to the lines of a matrix that contains the data I want to classify. The problem is that I need the results of that function separated by class. So far, I’m doing this:

cc<-vector(length=2)

mm<-list(length=2)

ii<-list(length=2)

temp1<-matrix(nrow=16,ncol=6)
temp1<-as.data.frame(temp1)
temp1[]<-c(256,235,194,235,215,173,215,215,194,215,215,215,194,173,152,215,
           430,388,388,388,388,430,430,430,388,346,346,388,388,388,346,388,
           283,317,283,283,248,283,283,283,214,214,248,283,214,283,214,248,
           3701,3450,3576,3826,3534,3450,3868,4035,3450,3493,3450,3701,3534,3242,3032,3116,
           1646,1589,1589,1646,1646,1589,1646,1732,1560,1475,1589,1589,1675,1532,1503,1418,
           474,556,556,515,556,556,597,637,556,515,515,515,515,515,434,434)


temp2<- matrix(nrow=11,ncol=6)
temp2<-as.data.frame(temp2)
temp2[]<-c(422,463,462,483,546,525,483,566,546,483,546,
           770,812,770,812,854,854,812,939,939,854,981,
           1038,1175,1004,1141,1209,1209,1038,1311,1311,1175,1311,
           2359,2359,2275,2359,2359,2359,2359,2401,2359,2401,2401,
           2445,2531,2417,2588,2759,2617,2388,2674,2730,2645,2731,
           1413,1413,1373,1495,1618,1535,1413,1535,1659,1535,1618)


cc[1]<-det(cov(temp1))
cc[2]<-det(cov(temp2))

mm[[1]]<-as.numeric(sapply(temp1,"mean"))
mm[[2]]<-as.numeric(sapply(temp2,"mean"))



ii[[1]]<-solve(cov(temp1))
ii[[2]]<-solve(cov(temp2))




data<-matrix(nrow=10,ncol=6)
data<-as.data.frame(data)
data[]<-c(181,203,224,203,203,224,181,181,161,161,
          338,338,338,338,296,296,338,381,338,296,
          208,242,208,208,208,208,208,242,208,173,
          3164,2954,2660,2787,2744,2787,2534,3457,2870,2912,
          1476,1505,1391,1332,1304,1391,1132,1591,1448,1304,
          474,474,474,515,392,432,432,556,515,474)



for (k in 1:2){
  Pxi<-apply(data,1,function(x)1/(2*pi^(6/2)*cc[k]^(1/2))*exp(-1/2*t(as.numeric(x-mm[[k]]))%*%ii[[k]]%*%(as.numeric(x-mm[[k]]))))

  if (k==1) {rule<-Pxi} else {rule<-cbind(rule,Pxi)}  
}


So I got it:

rule
              rule           Pxi
 [1,] 4.316396e-13  0.000000e+00
 [2,] 6.835553e-15 7.970888e-284
 [3,] 8.674921e-21 2.687251e-145
 [4,] 5.923777e-19 8.020048e-189
 [5,] 5.627127e-16 8.064007e-184
 [6,] 2.495667e-17 5.738550e-209
 [7,] 6.311390e-22  8.913098e-97
 [8,] 1.413893e-12  0.000000e+00
 [9,] 5.521715e-15 1.619401e-221
[10,] 5.212091e-17 5.810407e-254

Well, as you can imagine, data is actually much bigger than in my example, and this last loop is taking a very long time when k is too big. Any suggestions on how to make it faster?

2 Answers2

1

Using cbind() in a loop is very expensive. Instead, you should assign the intermediate loop results to a list and then do.call(cbind, rule) at the end:

Regarding why the apply() statement is slow, there are a lot of operations to go through for each row of data looped through. Instead, it is better to try to do matrix operations (or a function) all at once.

This uses the mahalanobis() function to simplify what's in the exp() call. It turns out that the function uses the same exact approach as @chinsoon12.

1 / (2*pi^(6/2)*det(cov(temp1))^(1/2))*exp(-1 / 2 * mahalanobis(data, colMeans(temp1), cov(temp1)))

mahalanobis
#function (x, center, cov, inverted = FALSE, ...) 
#{
#    x <- if (is.vector(x)) 
#        matrix(x, ncol = length(x))
#    else as.matrix(x)
#    if (!isFALSE(center)) 
#        x <- sweep(x, 2L, center)
#    if (!inverted) 
#        cov <- solve(cov, ...)
#    setNames(rowSums(x %*% cov * x), rownames(x))
#}
#<bytecode: 0x000000000c217d80>
#<environment: namespace:stats>

I would approach this by first making a list() of your temp data.frames and then using lapply() to loop through them:

tmps <- list(temp1, temp2)

do.call(cbind,
        lapply(tmps,
               function(tmp) {
                 n = length(tmp)
                 cov_tmp <- cov(tmp)
                 1 / (2*pi^(n/2)*det(cov_tmp)^(1/2))*exp(-1 / 2 * mahalanobis(data, colMeans(tmp), cov_tmp))
                 }
               )
)

              [,1]          [,2]
 [1,] 4.316396e-13  0.000000e+00
 [2,] 6.835553e-15 7.970888e-284
 [3,] 8.674921e-21 2.687251e-145
 [4,] 5.923777e-19 8.020048e-189
 [5,] 5.627127e-16 8.064007e-184
 [6,] 2.495667e-17 5.738550e-209
 [7,] 6.311390e-22  8.913098e-97
 [8,] 1.413893e-12  0.000000e+00
 [9,] 5.521715e-15 1.619401e-221
[10,] 5.212091e-17 5.810407e-254

Reference: http://sar.kangwon.ac.kr/etc/rs_note/rsnote/cp11/cp11-7.htm

Cole
  • 11,130
  • 1
  • 9
  • 24
  • I've corrected the code. Sorry about that. Your solution using _lapply_ was exactly what I was looking for. Thank you. However, processing time did not really improve. – Mariane Reis Nov 04 '19 at 20:02
  • thanks for sharing the info on `mahalanobis` distance. the max likelihood classifier makes sense now – chinsoon12 Nov 06 '19 at 01:34
1

Should be faster if you work in matrices. Here is a suggestion to replace the for loop

data <- as.matrix(data)
const <- 2*pi^(6/2)
do.call(cbind, lapply(1L:2L, function(k) {
    m <- sweep(data, 2L, mm[[k]])
    #1/(const*cc[k]^(1/2))* exp(-1/2 * diag(m %*% ii[[k]] %*% t(m)))
    1/(const*cc[k]^(1/2))* exp(-1/2 * rowSums((m %*% ii[[k]]) * m))
}))

The use of rowSums (instead of the original diag(m %*% ii[[k]] %*% t(m)) was from compute only diagonals of matrix multiplication in R

output:

              [,1]          [,2]
 [1,] 4.316396e-13  0.000000e+00
 [2,] 6.835553e-15 7.970888e-284
 [3,] 8.674921e-21 2.687251e-145
 [4,] 5.923777e-19 8.020048e-189
 [5,] 5.627127e-16 8.064007e-184
 [6,] 2.495667e-17 5.738550e-209
 [7,] 6.311390e-22  8.913098e-97
 [8,] 1.413893e-12  0.000000e+00
 [9,] 5.521715e-15 1.619401e-221
[10,] 5.212091e-17 5.810407e-254
chinsoon12
  • 25,005
  • 4
  • 25
  • 35