0

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?

best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
  for(j in 1:8){
    for(x in k){
      diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
      accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
      if( best_model[1] < accuracy[x] ){
        best_model[1] = accuracy[x]
        best_model[2] = i
        best_model[3] = j
      }
    }
  }
}

enter image description here

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Vipin Verma
  • 5,330
  • 11
  • 50
  • 92

3 Answers3

1

Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.

Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.

  library(sets)
  my.set <- 1:8  # you want column indices from 1 to 8
  my.power.set <- set_power(my.set)  # this creates the set of all subsets of those indices
  my.names <- c("a")  #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
  for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
  names(my.power.set) <- my.names
  my.indices <- vector("list",length(my.power.set)-1)
  for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work

I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).

But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.

J.P.
  • 55
  • 1
  • 5
  • This looks good to me, but i have used some alternative posted in other answer as of now. Thank you for the help – Vipin Verma Apr 16 '17 at 06:26
0

Solved it, below is the code with explanatory comments:

# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
  # ignoring the first case i.e. i=1, as it doesn't represent any model
  # convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
  combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
  model <- c()
  for(i in 1:length(combination)){
    # choose which columns to consider depending on the combination
    if(combination[i])
      model <- c(model, i)
  }
  for(x in k){
    # for the columns decides by model, find out the accuracies of model for k=1:27
    diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
    accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
    if( best_model_accuracy < accuracy[x] ){
      best_model_accuracy = accuracy[x]
      best_model = model
      print(model)
    }
  }
}
Vipin Verma
  • 5,330
  • 11
  • 50
  • 92
  • I'm sure you feel satisfied at solving the programming problem but you still have a serious problem of a statistical sort. This approach is very sensitive to the sampling. In general the results are misleading as far as what is the "best model" if you use the conventional significance levels.This approach fails to take into account the multiple comparisons and wildly over-estimates the goodness of fit measures. – IRTFM Apr 16 '17 at 08:06
  • I agree with what you said. could you please suggest what should be done to mitigate the problem? – Vipin Verma Apr 16 '17 at 21:52
  • Read up on "multiple comparisons problem" and consider using criteria for judging model comparisons that appropriate take account of the much higher degrees of freedom that you have "expended" during your "all models" comparison. It should not be just the number of variables times (levels minus 1) but should be set much higher. Also look at penalized methods. More appropriate place to look is CrossValidated.com. – IRTFM Apr 16 '17 at 22:12
0

I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression. RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations. Another way is to look at a matrix Scatter plot enter image description here

You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further) Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it) You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test) The high correlation between Skin and bmi means that you can use one or the other or an interact of both. Another approach to reducing the number of predictors is to use PCA

user34018
  • 223
  • 2
  • 9
  • I tried KNN with scaling, but it yields 100% accuracy everytime which i highly suspected to be true. – Vipin Verma Apr 24 '17 at 04:41
  • Can you share your data? – user34018 Apr 25 '17 at 11:04
  • data, I can see if the 100% is suspicious. It is a possibility that someone with diabetes has very different readings (measures) than someone without, which explains 100% accuracy. For instance, glucose level is very different from someone with no diabetes than from someone with diabetes. – user34018 Apr 25 '17 at 11:13
  • I scaled and center and I didn't get 100%. If you have Prima.tr for training, and Pima.te for testing you should have gotten at most Accuracy of 0.7289. Your confusion matrix should look close to this for the best model – user34018 Apr 25 '17 at 20:20
  • One more think, you also have to find the best k. In my example, k = 15 gave the best results, which was still as good as Logistic Regression results. – user34018 Apr 26 '17 at 02:14