I am working with the R programming language. Suppose I have the following data frame:
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
d = as.factor(sample( LETTERS[1:3], 100, replace=TRUE, prob=c(0.5, 0.3, 0.2) ))
my_data = data.frame(a,b,c,d)
head(my_data)
a b c d
1 12.433326 10.573004 2.586044 A
2 9.985524 8.903590 25.806358 C
3 9.538077 13.875609 -11.572231 C
4 9.342444 6.483715 4.056420 B
5 8.825197 8.633457 6.357470 A
6 9.121292 7.988194 15.999959 B
My Question : For any row
Where "d = A", I want to randomly replace column "a" with 0 20% of the time, column "b" with 0 30% of the time and column "c" with 0 50% of the time
Where "d = B", I want to randomly replace column "a" with 0 50% of the time, column "b" with 0 60% of the time and column "c" with 0 50% of the time
Where "d = C", I want to randomly replace column "a" with 0 20% of the time, column "b" with 0 15% of the time and column "c" with 0 20% of the time
I could do this using base R in a very ineffective way:
A <- my_data[which(my_data$d == "A"), ]
B <- my_data[which(my_data$d == "B"), ]
C <- my_data[which(my_data$d == "C"), ]
A$a_new <- sample( LETTERS[1:2], nrow(A), replace=TRUE, prob=c(0.2, 0.8) )
A$b_new <- sample( LETTERS[1:2], nrow(A), replace=TRUE, prob=c(0.3, 0.7) )
A$c_new <- sample( LETTERS[1:2], nrow(A), replace=TRUE, prob=c(0.5, 0.5) )
A$a_new2 = ifelse(A$a_new == "A", A$a, 0)
A$b_new2 = ifelse(A$b_new == "B", A$b, 0)
A$c_new2 = ifelse(A$b_new == "C", A$c, 0)
B$a_new <- sample( LETTERS[1:2], nrow(B), replace=TRUE, prob=c(0.5, 0.5) )
B$b_new <- sample( LETTERS[1:2], nrow(B), replace=TRUE, prob=c(0.6, 0.4) )
B$c_new <- sample( LETTERS[1:2], nrow(B), replace=TRUE, prob=c(0.5, 0.5) )
B$a_new2 = ifelse(B$a_new == "A", B$a, 0)
B$b_new2 = ifelse(B$b_new == "B", B$b, 0)
B$c_new2 = ifelse(B$b_new == "C", B$c, 0)
C$a_new <- sample( LETTERS[1:2], nrow(C), replace=TRUE, prob=c(0.2, 0.8) )
C$b_new <- sample( LETTERS[1:2], nrow(C), replace=TRUE, prob=c(0.15, 0.85) )
C$c_new <- sample( LETTERS[1:2], nrow(C), replace=TRUE, prob=c(0.8, 0.2) )
C$a_new2 = ifelse(C$a_new == "A", C$a, 0)
C$b_new2 = ifelse(C$b_new == "B", C$b, 0)
C$c_new2 = ifelse(C$b_new == "C", C$c, 0)
final = rbind(A,B,C)
head(final)
a b c d a_new b_new c_new a_new2 b_new2 c_new2
1 12.433326 10.573004 2.586044 A A B B 12.43333 10.573004 0
5 8.825197 8.633457 6.357470 A B B B 0.00000 8.633457 0
7 9.594164 10.600787 27.190108 A B A B 0.00000 0.000000 0
10 8.441369 1.944389 11.250866 A B A B 0.00000 0.000000 0
11 9.192280 13.970166 -2.829124 A B B A 0.00000 13.970166 0
12 9.916996 12.970319 3.472191 A B A A 0.00000 0.000000 0
Does anyone know if there is a more efficient way to solve this problem? Perhaps it could be done with the DPLYR library and the mutate() function?
Thanks!