0

I would like to ask if it is possible to apply this function to a data.table approach:

myfunction <- function(i) {

  a <- test.dt[i, 1:21, with = F]

  final <- t((t(b) == a) * value)
  final[is.na(final)] <- 0
  sum.value <- rowSums(final)

  final1 <- cbind(train.dt, sum.value)
  final1 <- final1[order(-sum.value),]
  final1 <- final1[final1$sum.value > 0,]

  suggestion <- unique(final1[, 22, with = F])
  suggestion <- suggestion[1:5, ]

  return(suggestion)
}

This is a custom kNN function I made to be used on character columns. It gives top 5 suggestions/predictions. However, It has performance issues on my end if it is performed on large test data (I cannot tweak it myself so far).

The variables used are as folllows:

train.dt -- the training data, includes 22 columns (21 features, 1 label column)
test.dt -- the test data, same structure as training data
value -- a vector that contains the weights/importance value of 21 features
sum.value -- sum of all the weights on value vector (sum(value))
b -- has the same data as the training data, but excluding the label column
a -- has the same data as the test data, but excluding the label column
suggestion -- the output

Also, I want to use lapply (or any appropriate apply family) on this function, and the i variable in the function pertains to the row number on the test data: meaning, I want to apply it on each rows of the test data. I cannot make it yet.

Hope you can understand and thank you in advance!

Lëmön
  • 322
  • 2
  • 14
  • If the calculations take a long time, why don't you split the data row-wise and use parallel capabilities? – Roman Luštrik Aug 10 '17 at 06:59
  • How can I do splitting the data row wise? I tried using parallel options but didn't work on my end (maybe I had wrong implementation on doparallel). – Lëmön Aug 10 '17 at 07:03
  • `split(x, 1:nrow(x))` and then use parSapply from `parallel`. You need to initialize the cluster and possibly load all the dependencies. – Roman Luštrik Aug 10 '17 at 07:06
  • I'll try this approach you mentioned. Will comment the progress. Thanks by the way. – Lëmön Aug 10 '17 at 07:15
  • 2
    This doesn't really utilize data.table. E.g., `rowSums` is a red flag. You should never use it with large data if you don't already have a matrix. You should use `Reduce` within the data.table. There are more issues. However, if you need more help you need to provide a minimal reproducible example. – Roland Aug 10 '17 at 07:26

0 Answers0