0

I have got a data.frame with approx. 20,000 columns. From this data.frame I want to remove columns for which the follow vector has a value of 1.

u.snp <- apply(an[25:19505], 2, mean)

I am sure there must be a straight forward way to accomplish this but can´t see it right now. Any hints would be greatly appreciated. Thanks.

Update: Thanks for your help. Now I tried the following:

cm <- colMeans(an.mdr[25:19505])
tail(sort(cm), n=40)

With the tail function I see that 22 columns out of 19481 columns of an.mdr have mean=1. Next I remove these columns using the code as suggested.

an.mdr.s <- an.mdr
an.mdr.s[colMeans(an.mdr.s[25:19505])==1] <- NULL

As anticipated an.mdr.s has 22 columns less than an.mdr. But when I calculate the column means for all but the first 24 columns I again have 22 columns with column mean=1 in an.mdr.s.

cmm <- colMeans(an.mdr.s[25:19483])
tail(sort(cmm), n=40)

Honestly, I cannot see what is going on here right now.

user102546
  • 17
  • 1
  • 6
  • you want to remove all columns whose mean is 1. right ? – YOLO Jul 22 '18 at 17:43
  • yes, exactly... – user102546 Jul 22 '18 at 17:45
  • 1
    If you feel an answer solved the problem, please mark it as 'accepted' by clicking the green check mark. This helps keep the focus on older SO which still don't have answers. – Vlad C. Jul 22 '18 at 18:42
  • @user102546 Not sure why you have mentioned `an[25:19505]` in your question. If you wants to remove any column having mean as `1` then better modify your question a bit so that it matches with answer. Thanks. – MKR Jul 22 '18 at 20:17

2 Answers2

3

That should be quite easily accomplished with the following command:

df[colMeans(df)==1] <- NULL
Vlad C.
  • 944
  • 7
  • 12
  • Thanks for the help. I have encountered another problem after removing the columns as suggested (see my edited post above). Can you spot the error I have made? Thanks. – user102546 Jul 23 '18 at 18:36
  • I see that you used `an.mdr.s[colMeans(an.mdr.s[25:19505])==1] <- NULL` rather than `an.mdr.s[colMeans(an.mdr.s)==1] <- NULL`. Is your goal to preserve the first 24 columns? Also, do you have 19505 columns in your dataset? – Vlad C. Jul 23 '18 at 18:55
  • If the above is true and you would indeed like to preserve the columns up to column #24 and column #19506 and the ones after it and remove the columns inbetween with mean 1, you can try `sel.col <- an.mdr.s %>% colMeans %>% equals(1) %>% inset(c(1:24, 19506:ncol(an.mdr.s)), T)`. This would create a vector of the length `ncol(an.mdr.s)` containing the columns with mean 1 as `TRUE` and the other ones as `FALSE`; it then forces the ones up to 24 and after 19506 to `TRUE`. Using that vector, you can then do `an.mdr.s[sel.col] <- NULL`. – Vlad C. Jul 23 '18 at 19:18
  • I want to preserve the first 24 columns regardless of their values as they contain meta data and are not used as independent variables in the logistic regression models. The dataset has 19505 columns in total. – user102546 Jul 23 '18 at 20:03
  • Ok. does `sel.col <- an.mdr.s %>% colMeans %>% equals(1) %>% inset(c(1:24), F)` and `an.mdr.s[sel.col] <- NULL` accomplish your goal? Note that you need `library(magrittr)` for this. – Vlad C. Jul 23 '18 at 20:48
0

You can do in two simple steps (df is your data frame):

# step 1 - calculate mean for all columns and filter with mean = 1
remove_columns <- sapply(df, mean)
remove_columns <- names(remove_columns[remove_columns == 1])

# alternate using filter (just for knowledge)
## remove_columns <- names(Filter(function(x) x == 1,sapply(df, mean)))

# step 2 - remove them
df_new <- df[,setdiff(names(df), remove_columns)]
YOLO
  • 20,181
  • 5
  • 20
  • 40