randomly replacing percentage of values per group with NA in R dataframe

Question

I have a dataframe with different groups (ID) of varying size. Within each group, I would like to randomly replace a specific percentage of values in the "value" column (let's say 30%) with NA's. Here is a simplified version of my data:

ID<-rep(c("X1","X2"),times=c(3,6))
value<-c(1,2,3,1,2,3,4,5,6)
df1 <- data.frame(ID,value)
df1
ID value
X1     1
X1     2
X1     3
X2     1
X2     2
X2     3
X2     4
X2     5
X2     6

Here is what I would like to have:

ID value
X1     1
X1     NA
X1     3
X2     1
X2     2
X2     NA
X2     4
X2     5
X2     NA

Any idea how I could do this? I have a preference for using tidyverse but if you have other options, that would also be appreciated!

score 2 · Accepted Answer · answered Sep 23 '20 at 19:30

2

We can use dplyr. Grouped by 'ID', get the index or 30% of the rows with sample and use that in replace to replace the 'value' with NA

library(dplyr)
df1 %>%
    group_by(ID) %>%
    mutate(value =  replace(value, sample(row_number(),  
           size = ceiling(0.3 * n()), replace = FALSE), NA) )
# A tibble: 9 x 2
# Groups:   ID [2]
#  ID    value
#  <chr> <dbl>
#1 X1       NA
#2 X1        2
#3 X1        3
#4 X2       NA
#5 X2        2
#6 X2       NA
#7 X2        4
#8 X2        5
#9 X2        6

answered Sep 23 '20 at 19:30

akrun

874,273
37
540
662

Hi, I'm having trouble using your answer. when I run this code, I get the following error: "n()` must only be used inside dplyr verbs.". Any idea where this would come from? – Cam Sep 23 '20 at 21:37
@Cam Here, `n()` is used within `dplyr` syntax ie. in `mutate`. Not sure how that error pop up? Can you show the `packageVersion('dplyr')` – akrun Sep 23 '20 at 21:40
@Cam Your input code had some typos. I just edited those in your post. Can you test now – akrun Sep 23 '20 at 21:43
it's version 1.0.1! And I'll try testing it now . – Cam Sep 23 '20 at 22:17
@Cam Can you try on a fresh R session with only `dplyr` loaded and the dataset because I can't replicate this – akrun Sep 23 '20 at 22:25
1

yes, that was it! Starting a new session did the trick. Thanks a lot!! – Cam Sep 23 '20 at 23:24

score 0 · Answer 2 · answered Sep 23 '20 at 19:29

0

Assuming data is in df

df[sample(seq(nrow(df)), nrow(df) *0.3), "value"] <- NA

answered Sep 23 '20 at 19:29

Oliver

8,169
3
15
37

score 0 · Answer 3 · answered Sep 23 '20 at 19:32

You can use sample() to get random indexes of your data.

You could try this

df <- data.frame(ID = paste("X", 1:10),
                 value = rnorm(10))

fraction <- 0.30

df$value[sample(1:length(df$value), size = round(length(df$value) * fraction))] <- NA

#30% of the values in df$value will then be NA

randomly replacing percentage of values per group with NA in R dataframe

3 Answers3

Linked