1

I have a very messy data frame consisting of factor columns with numbers and characters. I need to filter the rows with numeric values above a threshold. However, this is a problem because my columns are factors that cannot be turned to numeric, due to the presence of characters in them.

DF <- data.frame(
  Col1 = c("Egg", "", "3"),
  Col2 = c("", "Flour", ""),
  Col3 = c("2", "", "Bread"),
  Col4 = c("4", "", ""),
  Col5 = c("", "6", "8")
)

The resulting data frame looks like this:

> DF
  Col1  Col2  Col3 Col4 Col5
1  Egg           2    4     
2      Flour               6
3    3       Bread         8

Where each column is a factor:

> class(DF$Col1)
[1] "factor"
> 

In this example, how do I filter rows with numeric values above, say, 5 in at least one column? The desired output in this example, looks like this:

> DF
  Col1  Col2  Col3 Col4 Col5
2      Flour               6
3    3       Bread         8
NBK
  • 887
  • 9
  • 20

3 Answers3

3

You'll get some warnings from dplyr but this works as well:

library(dplyr)

DF %>%
  mutate_all(as.character) %>%
  filter_all(any_vars(if_else(is.na(as.numeric(.)), FALSE, as.numeric(.) > 5)))

  Col1  Col2  Col3 Col4 Col5
1      Flour               6
2    3       Bread         8

Per @Frank's suggestion (a bit cleaner than above):

DF %>%
  filter_all(any_vars(as.numeric(as.character(.)) > 5))

  Col1  Col2  Col3 Col4 Col5
1      Flour               6
2    3       Bread         8
zack
  • 5,205
  • 1
  • 19
  • 25
  • 1
    Or, if the OP for some reason wants them kept as factors: `DF %>% filter_all(any_vars(as.numeric(as.character(.)) > 5))` -- it seems filter will skip rows for which any_vars returns NA. – Frank Jul 06 '18 at 18:24
3

One can pick out only numeric values using gsub from each observation and convert it to numeric. Afterwards, in base-R subset with apply can provide a solution as:

subset(DF, apply(DF, 1, function(x){
                #Get only numeric values and convert to numeric
                val <- as.numeric(gsub("[^[:digit:]]", "",x))
                any(val[!is.na(val)] > 5)
                 })
       )

#   Col1  Col2  Col3 Col4 Col5
# 2      Flour               6
# 3    3       Bread         8
MKR
  • 19,739
  • 4
  • 23
  • 33
0

One way this can be done is:

DF[do.call(function(...) pmax(..., na.rm=TRUE), data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE)) > 5,]

To explain what this is doing, the lapply(DF, as.character) is removing the factors, then lapply(lapply(DF, as.character), as.numeric) is converting the characters to numbers (the text becomes NA), and then data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE) changes it back to a dataframe, e.g.

> data.frame(lapply(lapply(DF, as.character), as.numeric), stringsAsFactors = FALSE)
  Col1 Col2 Col3 Col4 Col5
1   NA   NA    2    4   NA
2   NA   NA   NA   NA    6
3    3   NA   NA   NA    8

The do.call with pmax finds the row maximum (thanks rowwise maximum for R) and then we can easily filter for a maximum value above 5.

Kerry Jackson
  • 1,821
  • 12
  • 20