1

I have a dataframe with different groups (ID) of varying size. Within each group, I would like to randomly replace a specific percentage of values in the "value" column (let's say 30%) with NA's. Here is a simplified version of my data:

ID<-rep(c("X1","X2"),times=c(3,6))
value<-c(1,2,3,1,2,3,4,5,6)
df1 <- data.frame(ID,value)
df1
ID value
X1     1
X1     2
X1     3
X2     1
X2     2
X2     3
X2     4
X2     5
X2     6

Here is what I would like to have:

ID value
X1     1
X1     NA
X1     3
X2     1
X2     2
X2     NA
X2     4
X2     5
X2     NA

Any idea how I could do this? I have a preference for using tidyverse but if you have other options, that would also be appreciated!

akrun
  • 874,273
  • 37
  • 540
  • 662
Cam
  • 449
  • 2
  • 7

3 Answers3

2

We can use dplyr. Grouped by 'ID', get the index or 30% of the rows with sample and use that in replace to replace the 'value' with NA

library(dplyr)
df1 %>%
    group_by(ID) %>%
    mutate(value =  replace(value, sample(row_number(),  
           size = ceiling(0.3 * n()), replace = FALSE), NA) )
# A tibble: 9 x 2
# Groups:   ID [2]
#  ID    value
#  <chr> <dbl>
#1 X1       NA
#2 X1        2
#3 X1        3
#4 X2       NA
#5 X2        2
#6 X2       NA
#7 X2        4
#8 X2        5
#9 X2        6
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi, I'm having trouble using your answer. when I run this code, I get the following error: "n()` must only be used inside dplyr verbs.". Any idea where this would come from? – Cam Sep 23 '20 at 21:37
  • @Cam Here, `n()` is used within `dplyr` syntax ie. in `mutate`. Not sure how that error pop up? Can you show the `packageVersion('dplyr')` – akrun Sep 23 '20 at 21:40
  • @Cam Your input code had some typos. I just edited those in your post. Can you test now – akrun Sep 23 '20 at 21:43
  • it's version 1.0.1! And I'll try testing it now . – Cam Sep 23 '20 at 22:17
  • @Cam Can you try on a fresh R session with only `dplyr` loaded and the dataset because I can't replicate this – akrun Sep 23 '20 at 22:25
  • 1
    yes, that was it! Starting a new session did the trick. Thanks a lot!! – Cam Sep 23 '20 at 23:24
0

Assuming data is in df

df[sample(seq(nrow(df)), nrow(df) *0.3), "value"] <- NA
Oliver
  • 8,169
  • 3
  • 15
  • 37
0

You can use sample() to get random indexes of your data.

You could try this

df <- data.frame(ID = paste("X", 1:10),
                 value = rnorm(10))

fraction <- 0.30

df$value[sample(1:length(df$value), size = round(length(df$value) * fraction))] <- NA

#30% of the values in df$value will then be NA
brendbech
  • 399
  • 1
  • 7