1

I have a dataset with a column Disease that contains string values. I also have a list of names with rare diseases rare_disease.

I want to check, for each cell of the column Disease, whether it contains an element from the list rare_disease and if so, to create a new column in my dataframe and give the value 1 to that entry.

I tried using the ifelse function, like so:

FinalData$RareDisease <- ifelse(rare_disease %in% FinalData$Disease,1,0)

But I guess that checks whether the the corresponding rows in both variables are the same, so it throws an error. Instead, I want every cell of Disease to be checked against every single element of rare_disease, if that makes sense.

I have also tried match and is.element() as suggested here Test if a vector contains a given element but they don't work either.

Floran Thovan
  • 133
  • 1
  • 8
  • 1
    Shouldn't it be ifelse(FinalData$Disease %in% rare_disease,1,0) – Esben Eickhardt Apr 30 '19 at 13:25
  • 1
    I think you need to flip it: `FinalData$RareDisease <- ifelse(FinalData$Disease %in% rare_disease, 1, 0)` – zx8754 Apr 30 '19 at 13:25
  • From your previous post, I think you want regex match, not exact match: https://stackoverflow.com/q/55907952/680068 , please provide example input, and expected output. – zx8754 Apr 30 '19 at 13:27

2 Answers2

2

You are almost right, but you should flip it:

FinalData$RareDisease <- ifelse(rare_disease %in% FinalData$Disease,1,0)
Esben Eickhardt
  • 3,183
  • 2
  • 35
  • 56
2

Here's a reproducible example/solution -- noting you can just use as.numeric instead of ifelse:

df <- data.frame(
  idx = 1:10,
  Disease = letters[1:10]
)
rare_disease <- letters[c(1, 5, 9)]

df
#>    idx Disease
#> 1    1       a
#> 2    2       b
#> 3    3       c
#> 4    4       d
#> 5    5       e
#> 6    6       f
#> 7    7       g
#> 8    8       h
#> 9    9       i
#> 10  10       j
rare_disease
#> [1] "a" "e" "i"

df$RareDisease <- as.numeric(df$Disease %in% rare_disease)
df
#>    idx Disease RareDisease
#> 1    1       a           1
#> 2    2       b           0
#> 3    3       c           0
#> 4    4       d           0
#> 5    5       e           1
#> 6    6       f           0
#> 7    7       g           0
#> 8    8       h           0
#> 9    9       i           1
#> 10  10       j           0

Created on 2019-04-30 by the reprex package (v0.2.1)

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Thanks, that still gives all 0s in my data so I guess there's probably a problem in the formatting, I'll have a look. That should still work if `rare_disease` is a dataframe, right? – Floran Thovan Apr 30 '19 at 13:45
  • @FloranThovan The above should work if `rare_disease` is a vector. If `rare_disease` is a `data.frame`, you would need to do: `(df$Disease %in% df_rare_disease$rare_disease)`. This is why creating reproducible example is preferred. – JasonAizkalns Apr 30 '19 at 13:49
  • Ah, got it. Thanks again for the help. – Floran Thovan Apr 30 '19 at 14:03