0

I have a rather large dataset named e. One continuous covariate e$rad.dose should range 0 - 60, however, I observed that four rows contain text and not numbers.

Question: how can I apply dplyr to remove these four rows?

I know several baseR solutions, but I am trying to improve my dplyr.

> table(e$rad.dose)

                    0         12       12,5         14         15         16       21,6 
       156       3291          4          1          1          6          2          1 
        22         24         25         26       27,5         28       33,3         35 
         1         14          7          1          1          7          1          1 
        36         45         48       49,4        5,4         50       50,4         52 
         1          2          1          1         17         12          9          9 
      52,2       53,2      53,24         54       54,4       54,6         55       55,5 
         1          1          1         94          1          1          1          1 
      55,8         56       56,7         57       57,6         58       59,4         60 
         1         14          1          1          2          3         26         41 
      60,9         64         68 gammaknife         GK 
         1          1          1          2          2 

I have tried

filter(simpson %in% 1:3, age>=18, rad.dose!= c("gammaknife","GK"))

But two rows remain

> table(e$rad.dose)

                    0         12       12,5         14         15         16       21,6 
        32       2276          0          0          0          0          0          0 
        22         24         25         26       27,5         28       33,3         35 
         0          7          0          0          0          0          0          1 
        36         45         48       49,4        5,4         50       50,4         52 
         0          1          0          0          5          3          1          9 
      52,2       53,2      53,24         54       54,4       54,6         55       55,5 
         0          0          1         21          0          0          0          0 
      55,8         56       56,7         57       57,6         58       59,4         60 
         0          4          0          0          0          1          7         19 
      60,9         64         68 gammaknife         GK 
         0          1          0          1          1 

I also tried str_detect but that did not solve it - or, at least, I have applied it wrongly:

filter(simpson %in% 1:3, age>=18, str_detect(rad.dose, c("gammaknife","GK")==FALSE)) 

Thank you in advance.

cmirian
  • 2,572
  • 3
  • 19
  • 59

3 Answers3

1

Let's grab some reproducible data first:

e <- data.frame(rad.dose = c(rnorm(10), LETTERS))
e$rad.dose <- gsub(".", ",", e$rad.dose, fixed = TRUE)
e
#>              rad.dose
#> 1  -0,713359526629519
#> 2    1,17199694030257
#> 3  -0,255927297100446
#> 4   0,904135913625966
#> 5    0,44009842543169
#> 6    2,05720978460597
#> 7  0,0285732465139118
#> 8  -0,030136802990553
#> 9  -0,291809792426389
#> 10   0,30677039655244
#> 11                  A
#> 12                  B
#> 13                  C
#> 14                  D
#> 15                  E
#> 16                  F
#> 17                  G
#> 18                  H
#> 19                  I
#> 20                  J
#> 21                  K
#> 22                  L
#> 23                  M
#> 24                  N
#> 25                  O
#> 26                  P
#> 27                  Q
#> 28                  R
#> 29                  S
#> 30                  T
#> 31                  U
#> 32                  V
#> 33                  W
#> 34                  X
#> 35                  Y
#> 36                  Z

As far as I can see, you have three problems with the rad.dose column: you have text in your column, you have commas in the numbers and your numeric column is formatted as character (the whole thing, otherwise you wouldn't be able to store text in it). I would correct this first before continuing

library(dplyr)
library(stringr)
e %>% 
  filter(str_detect(rad.dose, "[0-9,]+")) %>%                    # only allow values which consist entirly of numbers and comma
  mutate(rad.dose = as.numeric(str_replace(rad.dose, ",", "."))) # replace comma with dot and transform as numeric
#>       rad.dose
#> 1  -0.71335953
#> 2   1.17199694
#> 3  -0.25592730
#> 4   0.90413591
#> 5   0.44009843
#> 6   2.05720978
#> 7   0.02857325
#> 8  -0.03013680
#> 9  -0.29180979
#> 10  0.30677040

Now you can use filter(simpson %in% 1:3, age>=18) on this data.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
1

Keep numbers as numeric:

e %>% 
  mutate(
    rad.dose.clean = as.numeric(sub(",", ".", rad.dose, fixed = TRUE))) %>% 
  filter(simpson %in% 1:3, age >= 18, !is.na(rad.dose.clean))
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

Exclude rows where rad.dose contains non-numeric characters (and comma) works, but is not perfect.

dplyr::filter(e, !grepl('[^0-9,-]', rad.dose))

Above line still does not filter out '---,--'

Other option: replace commas by decimal, and see if it is.numeric

e %>%
  mutate(rad.dose_numeric = as.numeric(gsub(',','.', rad.dose))) %>%  # extra column
  dplyr::filter(!is.na(rad.dose_numeric)) %>%                         # filter out NAs
  select(-rad.dosenumeric)                                            # remove col