2

I'm doing initial data clean up with 34,000 columns in a dataframe and in order to do that, i have to remove columns whose max value is less than 2.

I'm clueless as to how to remove columns with maxvalue less than 2 but for just getting max values, I tried creating a function as below without converting data with is.numeric:

protein <- is.numeric(protein)
#a: 
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colMax(protein)

I got the max not meaningful for factors error, which is why i used the is.numeric function to convert all data to numeric form. despite doing that I still am not getting the desired result. When running the function I got 0 as a result rather than a list of max values for each column.

Why am i getting 0 for my max function?How do I setup a function that can generate max values for each column and remove any columns whose max values are less than 2? Would I need 2 separate functions?

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Heena
  • 113
  • 1
  • 2
  • 6

2 Answers2

1

You were nearly there.

Since you don't provide reproducible sample data let's first create some minimal sample data

df <- as.data.frame(matrix(rep(1:10, each = 10), ncol = 10))
df
#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1   1  2  3  4  5  6  7  8  9  10
#2   1  2  3  4  5  6  7  8  9  10
#3   1  2  3  4  5  6  7  8  9  10
#4   1  2  3  4  5  6  7  8  9  10
#5   1  2  3  4  5  6  7  8  9  10
#6   1  2  3  4  5  6  7  8  9  10
#7   1  2  3  4  5  6  7  8  9  10
#8   1  2  3  4  5  6  7  8  9  10
#9   1  2  3  4  5  6  7  8  9  10
#10  1  2  3  4  5  6  7  8  9  10

We now would like to keep only those columns where the max value is >2; we can do this using sapply

df[sapply(df, function(x) max(x, na.rm = T) > 2)]
#   V3 V4 V5 V6 V7 V8 V9 V10
#1   3  4  5  6  7  8  9  10
#2   3  4  5  6  7  8  9  10
#3   3  4  5  6  7  8  9  10
#4   3  4  5  6  7  8  9  10
#5   3  4  5  6  7  8  9  10
#6   3  4  5  6  7  8  9  10
#7   3  4  5  6  7  8  9  10
#8   3  4  5  6  7  8  9  10
#9   3  4  5  6  7  8  9  10
#10  3  4  5  6  7  8  9  10

Explanation: sapply loops over the columns of the data.frame df and returns a logical vector (with as many entries as there are columns in df).


Or we can use pmax with apply

df[apply(pmax(df) > 2, 2, all)]

giving the same result. The difference to the first method is that pmax returns a matrix on which we operate column-wise with apply(..., MARGIN = 2, ...).

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • thanks, my datatype issue still exists though. Running the above code with converting my data using is.numeric gives me the result as logical(0). Am I doing something wrong with my data conversion? – Heena Sep 15 '19 at 22:10
  • @Heena - `as.numeric` doesn't work on the whole dataset all at once. Something like `protein[] <- lapply(protein, as.numeric)` will be needed to overwrite the original data. – thelatemail Sep 15 '19 at 22:14
  • It's difficult to help without representative sample data. So I need to speculate here. Take a look at my example. `data` needs to be a `data.frame`, not a column. Perhaps your conversion issue stems from the fact that you're trying to convert a `factor` to a `numeric` vector, in which case try `data$your_column <- as.numeric(as.character(df$your_column))`. – Maurits Evers Sep 15 '19 at 22:15
  • @Heena I've made an edit to my post to include some further explanations. Please take a look. – Maurits Evers Sep 15 '19 at 22:38
1

Here is another way using dplyr to select columns where max value is greater than equal to 2. Assuming, we want to test for all the columns and all those columns are of class factor. Using @Maurits data

library(dplyr)

df %>%
  #Convert column from factor to numeric
  mutate_all(~as.numeric(as.character(.))) %>%
  #Select column whose max value is greater than equal to 2 
  select_if(~max(., na.rm = TRUE) >= 2)


#   V3 V4 V5 V6 V7 V8 V9 V10
#1   3  4  5  6  7  8  9  10
#2   3  4  5  6  7  8  9  10
#3   3  4  5  6  7  8  9  10
#4   3  4  5  6  7  8  9  10
#5   3  4  5  6  7  8  9  10
#6   3  4  5  6  7  8  9  10
#7   3  4  5  6  7  8  9  10
#8   3  4  5  6  7  8  9  10
#9   3  4  5  6  7  8  9  10
#10  3  4  5  6  7  8  9  10

Instead of max, we can also use any

df %>%
  mutate_all(~as.numeric(as.character(.))) %>% 
  select_if(~any(. >= 2))

You say that you have 34000 columns. Do you want to check for greater than 2 condition for all the columns? Are all the columns factors ? The above code checks for all the columns and selects the one which do not satisfy the condition. If you want to do this on selected columns (not all), you might need to subset data, select those column and then apply the code.


In base R, we can also use colSums after converting the data from factor to numeric

df[] <- lapply(df, function(x) as.numeric(as.character(x)))
df[, colSums(df >= 2) > 0]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • No, I'm missing a 6 in my # of columns, its 364000 columns - some are factors, some aren't. Using lapply is working but with so much data, its taking too long to process. Would the above methods be any faster? – Heena Sep 16 '19 at 00:53
  • @Heena I haven't tested it but `colSums` approach should be faster. – Ronak Shah Sep 16 '19 at 01:08
  • thank you, i realized I was missing the parameter stringasfactors=False and that was causing the huge delay in processing. But once fixed it all worked out. – Heena Sep 17 '19 at 22:56