0

I want to use knngow in the dprep package. And, in addition to returning the appropriate label for the test data, I also want to return the row index to the nearest neighbor(in train data). Is there any function in this package for this job?My data is as follows.

df1<-data.frame(c("a","b","c"),c(1,2,3),c("T","F","T"))
df2<-data.frame(c("a","d","f"),c(4,1,3),c("F","F","T"))
mylist1<-list()
mylist1[[1]]<-df1
mylist1[[2]]<-df2
tst1<-data.frame(c("f"),c(2))
library(dprep)
for(i in 1:length(mylist1)){
    knn_model<-knngow(mylist1[[i]],tst1,1)}

I want, in addition to returning the label,For example, show that the nearest neighbor is in line 3 in mylist[[2]]

maria
  • 45
  • 6

1 Answers1

1

updated based on your comments

I don't see any function that returns the indices of the nearest neighbors in the train data concerning the dprep package (hopefully I don't miss something). However, what you can do is first calculating a distance matrix using the gower distance (FD package) and then pass this matrix to a k-nearest-neighbors function (the KernelKnn package accepts a distance matrix as input). If you decide to use the KernelKnn package then first install the latest version using devtools::install_github('mlampros/KernelKnn').

# train-data    [ "col3" is the response variable, 'stringsAsFactors' by default ]
df1 <- data.frame(col1 = c("a","d","f"), col2 = c(1,3,2), col3 = c("T","F","T"), stringsAsFactors = T)                           

# test-data
tst1 <- data.frame(col1 = c("f"), col2 = c(2), stringsAsFactors = T)                                      

# rbind train and test data (remove the response variable from df1)
df_all = rbind(df1[, -3], tst1)                                                         

# calculate distance matrix
dist_gower = as.matrix(FD::gowdis(df_all))

# use the dist_gower distance matrix as input to the 'distMat.knn.index.dist' function
# additionaly specify which row-index is the test-data observation from the previously 'df_all' data.frame using the 'TEST_indices' parameter
idxs = KernelKnn::distMat.knn.index.dist(dist_gower, TEST_indices = c(4), k = 2, threads = 1, minimize = T)

idxs$test_knn_idx returns the k-nearest-neighbors of the test data observation in the train data

print(idxs)

$test_knn_idx
     [,1] [,2]
[1,]    3    1

$test_knn_dist
     [,1] [,2]
[1,]    0 0.75

if you want also the probability for the class labels, then first convert to numeric and then use the distMat.KernelKnn function

y_numeric = as.numeric(df1$col3)

labels = KernelKnn::distMat.KernelKnn(dist_gower, TEST_indices = c(4), y = y_numeric, k = 2, regression = F, threads = 1, Levels = sort(unique(y_numeric)), minimize = T)

print(labels)

     class_1 class_2
[1,]       0       1

# class_2 corresponds to "T" from col3 (df1 data.frame)

Alternatively, you could take a look to the dprep::knngow and especially the second part of the function which is actually what you are interested in,

> print(dprep::knngow)

....
    else {
        for (i in 1:ntest) {

            tempo = order(StatMatch::gower.dist(test[i, -p], train[, -p]))[1:k]

            classes[i] = moda(train[tempo, p])[1]
        }
    }
.....
lampros
  • 581
  • 5
  • 12
  • Thank you very much for your advice. But in the gowdis function, it calculates the distance between the samples within a data frame. and When we pass the matrix of this function to distMat.knn.index.dist, Consequently, for each instance, the index gives the nearest neighbor within the same data frame.   But my test samples are in a separate data frame and the train data is in another data frame. So, I want the index of the nearest neighbor for a test instance in the train data.Do you have any suggestions for this? Thank you for your help – maria Nov 07 '17 at 20:36