How can I filter() rows containing text in a column comprised of text and numbers in dplyr?

Question

I have a rather large dataset named e. One continuous covariate e$rad.dose should range 0 - 60, however, I observed that four rows contain text and not numbers.

Question: how can I apply dplyr to remove these four rows?

I know several baseR solutions, but I am trying to improve my dplyr.

> table(e$rad.dose)

                    0         12       12,5         14         15         16       21,6 
       156       3291          4          1          1          6          2          1 
        22         24         25         26       27,5         28       33,3         35 
         1         14          7          1          1          7          1          1 
        36         45         48       49,4        5,4         50       50,4         52 
         1          2          1          1         17         12          9          9 
      52,2       53,2      53,24         54       54,4       54,6         55       55,5 
         1          1          1         94          1          1          1          1 
      55,8         56       56,7         57       57,6         58       59,4         60 
         1         14          1          1          2          3         26         41 
      60,9         64         68 gammaknife         GK 
         1          1          1          2          2

I have tried

filter(simpson %in% 1:3, age>=18, rad.dose!= c("gammaknife","GK"))

But two rows remain

> table(e$rad.dose)

                    0         12       12,5         14         15         16       21,6 
        32       2276          0          0          0          0          0          0 
        22         24         25         26       27,5         28       33,3         35 
         0          7          0          0          0          0          0          1 
        36         45         48       49,4        5,4         50       50,4         52 
         0          1          0          0          5          3          1          9 
      52,2       53,2      53,24         54       54,4       54,6         55       55,5 
         0          0          1         21          0          0          0          0 
      55,8         56       56,7         57       57,6         58       59,4         60 
         0          4          0          0          0          1          7         19 
      60,9         64         68 gammaknife         GK 
         0          1          0          1          1

I also tried str_detect but that did not solve it - or, at least, I have applied it wrongly:

filter(simpson %in% 1:3, age>=18, str_detect(rad.dose, c("gammaknife","GK")==FALSE))

Thank you in advance.

in your filter your data.frame should be your first input, `filter(e,....)` — Mike, Mar 18 '20 at 12:55
you can also use `filter(simpson %in% 1:3 & age>=18 & !(rad.dose %in% c("gammaknife","GK")))` — akash87, Mar 18 '20 at 13:07

JBGruber · Answer 1 · 2020-03-18T13:25:23.397

Let's grab some reproducible data first:

e <- data.frame(rad.dose = c(rnorm(10), LETTERS))
e$rad.dose <- gsub(".", ",", e$rad.dose, fixed = TRUE)
e
#>              rad.dose
#> 1  -0,713359526629519
#> 2    1,17199694030257
#> 3  -0,255927297100446
#> 4   0,904135913625966
#> 5    0,44009842543169
#> 6    2,05720978460597
#> 7  0,0285732465139118
#> 8  -0,030136802990553
#> 9  -0,291809792426389
#> 10   0,30677039655244
#> 11                  A
#> 12                  B
#> 13                  C
#> 14                  D
#> 15                  E
#> 16                  F
#> 17                  G
#> 18                  H
#> 19                  I
#> 20                  J
#> 21                  K
#> 22                  L
#> 23                  M
#> 24                  N
#> 25                  O
#> 26                  P
#> 27                  Q
#> 28                  R
#> 29                  S
#> 30                  T
#> 31                  U
#> 32                  V
#> 33                  W
#> 34                  X
#> 35                  Y
#> 36                  Z

As far as I can see, you have three problems with the rad.dose column: you have text in your column, you have commas in the numbers and your numeric column is formatted as character (the whole thing, otherwise you wouldn't be able to store text in it). I would correct this first before continuing

library(dplyr)
library(stringr)
e %>% 
  filter(str_detect(rad.dose, "[0-9,]+")) %>%                    # only allow values which consist entirly of numbers and comma
  mutate(rad.dose = as.numeric(str_replace(rad.dose, ",", "."))) # replace comma with dot and transform as numeric
#>       rad.dose
#> 1  -0.71335953
#> 2   1.17199694
#> 3  -0.25592730
#> 4   0.90413591
#> 5   0.44009843
#> 6   2.05720978
#> 7   0.02857325
#> 8  -0.03013680
#> 9  -0.29180979
#> 10  0.30677040

Now you can use filter(simpson %in% 1:3, age>=18) on this data.

score 1 · Accepted Answer · answered Mar 18 '20 at 13:19

1

Keep numbers as numeric:

e %>% 
  mutate(
    rad.dose.clean = as.numeric(sub(",", ".", rad.dose, fixed = TRUE))) %>% 
  filter(simpson %in% 1:3, age >= 18, !is.na(rad.dose.clean))

answered Mar 18 '20 at 13:19

zx8754

52,746
12
114
209

1

@cmirian [see related post: df %>% drop_na](https://stackoverflow.com/q/22353633/680068) – zx8754 Mar 18 '20 at 14:36

score 1 · Answer 3 · answered Mar 18 '20 at 13:28

Exclude rows where rad.dose contains non-numeric characters (and comma) works, but is not perfect.

dplyr::filter(e, !grepl('[^0-9,-]', rad.dose))

Above line still does not filter out '---,--'

Other option: replace commas by decimal, and see if it is.numeric

e %>%
  mutate(rad.dose_numeric = as.numeric(gsub(',','.', rad.dose))) %>%  # extra column
  dplyr::filter(!is.na(rad.dose_numeric)) %>%                         # filter out NAs
  select(-rad.dosenumeric)                                            # remove col

How can I filter() rows containing text in a column comprised of text and numbers in dplyr?

3 Answers3