4

I had a dataset with 36400 columns/features/predictors (types of proteins) and 500 observations and the last column is response column "class" that indicates 2 types of cells - A and B. we're supposed to perform feature selection to reduce the number of predictors that help in differentiating the 2 cell types. The first step to do that was to remove all columns whose max value was less than 2. I did the below to achieve this and reduced the # of predictors to 26000:

newdf<- protein2 %>%
#Select column whose max value is greater than equal to 2 
select_if(~max(., na.rm = TRUE) >= 2)
ncol(newdf)

To further reduce, we're expected to remove predictors with low variance by performing anova test on each predictor and removing predictors with p-value >= 0.01. I think I did it right using the below code:

scores <- as.data.frame(apply(newdf[,-ncol(newdf)],2, anovaScores, newdf$class))
scores
new_scores <- scores[scores<0.01]

I'm not sure why, but i can't confirm my results using ncols or colnames or something. using length(new_scores) gives 2084 which is in the range of reduced predictors professor is expecting. But i need someone to confirm if this was the right way to go about. And if so, then why am I not able to split my data into training and testing datasets? when trying that, i get the error

Error in new_scores$class : $ operator is invalid for atomic vectors.

This is how I'm splitting training and testing dataset:

intrain <- createDataPartition(y = new_scores$class  ,p = 0.8,list = FALSE) #split data
assign("training", new_scores[intrain,] )
assign("testing",  new_scores[-intrain,] )

The problem is in the createDataPartition line but not sure if something it did in the prior steps is incorrect or I'm missing something

Not sure how to provide reproducible data but the below is a snippet of the data with last column being response variable-class, and the rest all predictors:

     X  Y    Z     A  B  C     class
     3  4.5  3     4  8  10.1  A
     9  6    2.5   6  4  4     B
     4  3.8  4     9  6  8.2   B
     6  7.1  6     7  4  8     A
     4  5.6  9     5  3  7.5   A
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Heena
  • 113
  • 1
  • 2
  • 6
  • Per `r` tag (hover to see): Use `dput()` for data and specify all non-base packages with `library()` calls. – Parfait Sep 21 '19 at 22:37

1 Answers1

0

You can do it like this, I used an example dataset:

library(caret)
library(mlbench)
data(Sonar)
newdf = Sonar

It makes sense to split the data into train and test first (see below comments by @missuse for details and also other possible alternatives) :

intrain <- createDataPartition(y = newdf$Class  ,p = 0.8,list = FALSE) #split data
training = newdf[intrain,]
test = newdf[-intrain,]

We calculate the scores, it will return a vector:

scores <- apply(training[,-ncol(training)],2, anovaScores, training$Class)
table(scores<0.01)
FALSE  TRUE 
   33    27

We expect to get back 27 predictor columns that have p < 0.01. We subset the data.frame with that. We write a vector for the columns to retain (including the dependent variable):

keep = c(which(scores<0.01),ncol(training))

training = training[,keep]
test = test[,keep]

> dim(training)
[1] 167  28
> dim(test)
[1] 167  28

And you can run caret from here.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • This will produce data leakage since you are using the test set data to get the scores. The best way is to include the filter in model building and perform it for each train-test split. more info: https://topepo.github.io/caret/feature-selection-using-univariate-filters.html. A more intuitive way of doing so is using [mlr3pipleines](https://mlr3pipelines.mlr-org.com/) along with [mlr3filters](https://mlr3filters.mlr-org.com/). – missuse Oct 08 '20 at 10:56
  • you are correct. I was just solving it programming wise, i.e getting the function to work. Sounds great, mlr3pipelines maybe you wanna write an answer? I can delete mine – StupidWolf Oct 08 '20 at 11:14
  • Another approach for feature filters compatible with caret but with a intuitive usage is with the [FSinR](https://cran.r-project.org/web/packages/FSinR/vignettes/FSinR.html) package. I might opt to write an answer but atm I have no time. In truth I'd prefer if you edited yours so I could +1 it. Another useful blog post: https://towardsdatascience.com/feature-selection-by-filtering-what-could-go-wrong-spoiler-a-lot-5d3bab16317 – missuse Oct 08 '20 at 11:55