I had a dataset with 36400 columns/features/predictors (types of proteins) and 500 observations and the last column is response column "class" that indicates 2 types of cells - A and B. we're supposed to perform feature selection to reduce the number of predictors that help in differentiating the 2 cell types. The first step to do that was to remove all columns whose max value was less than 2. I did the below to achieve this and reduced the # of predictors to 26000:
newdf<- protein2 %>%
#Select column whose max value is greater than equal to 2
select_if(~max(., na.rm = TRUE) >= 2)
ncol(newdf)
To further reduce, we're expected to remove predictors with low variance by performing anova test on each predictor and removing predictors with p-value >= 0.01. I think I did it right using the below code:
scores <- as.data.frame(apply(newdf[,-ncol(newdf)],2, anovaScores, newdf$class))
scores
new_scores <- scores[scores<0.01]
I'm not sure why, but i can't confirm my results using ncols
or colnames
or something. using length(new_scores) gives 2084 which is in the range of reduced predictors professor is expecting. But i need someone to confirm if this was the right way to go about. And if so, then why am I not able to split my data into training and testing datasets?
when trying that, i get the error
Error in new_scores$class : $ operator is invalid for atomic vectors.
This is how I'm splitting training and testing dataset:
intrain <- createDataPartition(y = new_scores$class ,p = 0.8,list = FALSE) #split data
assign("training", new_scores[intrain,] )
assign("testing", new_scores[-intrain,] )
The problem is in the createDataPartition
line but not sure if something it did in the prior steps is incorrect or I'm missing something
Not sure how to provide reproducible data but the below is a snippet of the data with last column being response variable-class, and the rest all predictors:
X Y Z A B C class
3 4.5 3 4 8 10.1 A
9 6 2.5 6 4 4 B
4 3.8 4 9 6 8.2 B
6 7.1 6 7 4 8 A
4 5.6 9 5 3 7.5 A