Deleting unnecessary rows after column shuffling in a data frame in R

Question

I have a data frame as below. The Status of each ID recorded in different time points. 0 means the person is alive and 1 means dead.

ID   Status
1    0
1    0
1    1
2    0
2    0
2    0
3    0
3    0
3    0
3    1

I want to shuffle the column Status and each ID can have a status of 1, just one time. After that, I want to have NA for other rows. For instance, I want my data frame to look like below after shuffling:

ID   Status
1    0
1    0
1    0
2    0
2    1
2    NA
3    0
3    1
3    NA
3    NA

*"I want to shuffle the column `Status`"* I don't understand what you're trying to do. By shuffling the column, do you mean a permutation of the *entries* of `Status`? *"[A]nd each ID can have a status of 1,"* But `ID=1` has no `Status=1` in your expected output. Can you clarify on the rules? — Maurits Evers, Mar 28 '18 at 21:20
Why `ID = 1` and `ID = 3` has not been treated sameway in your example? — MKR, Mar 28 '18 at 21:25

Mike H. · Answer 1 · 2018-03-28T21:36:58.000

3

From the data you posted and your example output, it looks like you want to randomly sample df$Status and then do the replacement. To get what you want in one step you could do:

set.seed(3)
df$Status <- ave(sample(df$Status), df$ID, FUN = function(x) replace(x, which(cumsum(x)>=1)[-1], NA))

df
# ID Status
#1   1      0
#2   1      0
#3   1      0
#4   2      1
#5   2     NA
#6   2     NA
#7   3      0
#8   3      0
#9   3      1
#10  3     NA

edited Mar 28 '18 at 21:36

answered Mar 28 '18 at 21:29

Mike H.

13,960
2
29
39

Good solution. In case OP doesnt want to `randomly sample` then your solution will work by just removing `sample` part. I have compared result of your method matched with mine using my modified `dataframe` – MKR Mar 28 '18 at 21:41
Fabulous! I was doing this using a for loop but the dataset that I have is huge, so it was taking too long. I knew there should be a simple fast way for it! Thanks – Slouei Mar 29 '18 at 01:22

MKR · Answer 2 · 2018-03-28T21:38:26.003

One option to use cumsum of cumsum to decide first 1 appearing for an ID.

Note that I have modified OP's sample dataframe to represent logic of reshuffling.

library(dplyr)
df %>% group_by(ID) %>% 
  mutate(Sum = cumsum(cumsum(Status))) %>%
  mutate(Status = ifelse(Sum > 1, NA, Status)) %>%
  select(-Sum)
# # A tibble: 10 x 2
# # Groups: ID [3]
# ID Status
# <int>  <int>
# 1     1      0
# 2     1      0
# 3     1      1
# 4     2      0
# 5     2      1
# 6     2     NA
# 7     3      0
# 8     3      1
# 9     3     NA
# 10    3     NA

Data

df <- read.table(text = 
"ID   Status
1    0
1    0
1    1
2    0
2    1
2    0
3    0
3    1
3    0
3    0", header = TRUE)

Deleting unnecessary rows after column shuffling in a data frame in R

2 Answers2