0

I have a very long dataset and a relatively short list of ID values for which my data is wrong. The following works, but my wrong_IDs vector is actually much larger:

wrong_IDs <- c('A1', 'B3', 'B7', 'Z31')
df$var1[df$var2 == 'A1' | df$var2 == 'B3' | df$var2 == 'B7' | df$var2 == 'Z31'] <- 0L

This looks very basic but I haven't found a compact way of writing this. Thanks for any help

Antonio
  • 158
  • 1
  • 14

2 Answers2

1

You can compare your data to the wrong_IDs with the %in% operator

df <- data.frame("var1" = 101:120, "var2" = c(1:20))
wrong_ids <- c(3, 5, 7)
df$var1[df$var2 %in% wrong_ids] <- 0

where df$var2 %in% wrong_ids provides you a TRUE/FALSE boolean vector that applies only the "set to zero" operation on the selected rows (here row 3, 5 and 7).

Stéphane V
  • 1,094
  • 2
  • 11
  • 25
1

Here's a very compact solution using grepl and regex:

Some illustrative data:

set.seed(123)
df <- data.frame(
  ID = paste0(rep(LETTERS[1:3], 2), sample(1:3, 6, replace = T)),
  Var2 = rnorm(6),
  stringsAsFactors = F)
df

wrong_IDs <- c('A1', 'B3', 'B1', 'C3')

To set to 0 those rows that contain the wrong_IDs you can collapse these values into a single string separated only by the regex alternation operator | and instruct grepl to match these alternative patterns in df$ID:

df$ID <- ifelse(grepl(paste0(wrong_IDs, collapse = "|"), df$ID), 0, df$ID)
df
  ID        Var2
1  0  0.07050839
2  0  0.12928774
3 C2  1.71506499
4 A3  0.46091621
5  0 -1.26506123
6 C1 -0.68685285
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • thank you, but this goes beyond my needs and it is not entirely clear to me. – Antonio May 05 '20 at 14:40
  • If you do `paste0(wrong_IDs, collapse = "|")`the result is `"A1|B3|B1|C3"`. The function `grepl` now looks to match either `A1` or `B3` or `B1`or `C3` in the `ID`column. If it finds a match, the respective row is returned, if it does not find one, the respective row is not returned. Does this help clarify the code? – Chris Ruehlemann May 05 '20 at 14:44
  • Yes, this is now clear, thank you. This means that: df[grepl(paste0(wrong_IDs, collapse = "|"), df$ID),] <- 0L would allow me to correct exactly those cell? I don't mean to be rude, I'm honestly wondering if this isn't unnecessarily complicated as compared to the other solution? – Antonio May 05 '20 at 14:50
  • 1
    How to assess this solution is entirely up to you ;) I've updated it anyway. – Chris Ruehlemann May 05 '20 at 14:55
  • Do feel free to accept the other solution which suits your needs better. But maybe this solution teaches you something new and may be useful for future queries. – Chris Ruehlemann May 05 '20 at 14:58
  • it surely helps with the understanding of grepl and an ifelse statement. thank you :) – Antonio May 05 '20 at 14:59