A bigish data.table with 750k rows and almost 200 columns, but this will do:
dt <- data.table(id = 1:15,
outcome = factor(c(0, 0, 0, 1, 1, 3, 3, 3, 3, 3, 0, 3, 1, 1, 3),
labels = c("F0","F1","F3")),
var1 = c(0, 0, 0, 1, 1, 2, 2, 2, NA, NA, 0, 0, 0, 0, 0),
var2 = c(0, 0, NA, NA, NA, 0, 0, 0, 0, 0, 6, 6, 4, 4, 4))
I want to change/group the "labels" of the variables var1, var2, (and any other variable) by a condition in the outcome variable. A table explains visually what I want to change
xtabs(~var1+outcome, dt, addNA = TRUE)
xtabs(~var2+outcome, dt, addNA = TRUE)
When outcome==F1 has no occurrences, the labels of var1, var2, and any other variables should be grouped. From the 1st table: change 2; from the 2nd table, group 0 and 6.
If the number of levels and variables were small, I could do this by hand with one liners:
dt$var1[dt$var1==2] <- "nF" #data frame way
dt[, var1 := as.character(var1)][var1 == "2", var1 := "nF"] #data.table way
xtabs(~var1+outcome, dt, addNA = TRUE) #check
outcome
var1 F0 F1 F3
0 4 2 2
1 0 2 0
nF 0 0 3
<NA> 0 0 2
Both of these one liners work, but as you can imagine, with 200 columns and some variables with over a thousand levels, this is impossible.
So I came up with an idea:
- build a table for every variable with respect to the outcome
- get the labels where outcome==F1 has 0 occurrences
- change the variable with an ifelse statement
Step 1
#rebuild dt to try this
(temp1 <- dcast(data = dt,
formula = var2 ~ outcome,
value.var = "outcome",
fun.aggregate = length))
Step 2
tempvar <- temp1[F1==0 & var2!="NA", var2]
Step 3
dt[, var2 := ifelse(var2 %in% tempvar, "nF", var2)]
xtabs(~var2+outcome, dt, addNA = TRUE) #check
outcome
var2 F0 F1 F3
4 0 2 1
nF 3 0 6
<NA> 1 2 0
This also works, and I avoid having to go through all those labels. So I made it a for loop... with a very cool counter =p
# Initialize
tabs <- c()
temp <- c()
counter <- 0
for (i in colnames(dt[, c("var1", "var2")])) {
# counter & progress
counter <- counter + 1
cat("Variable: ", counter, "of", ncol(dt), " ", i, "\n")
# build tables for each variable with dcast
tabs[[i]] <- dcast(data = dt,
formula = dt[[i]] ~ outcome,
value.var = "outcome",
fun.aggregate = length)
# temp: labels to group & set name
temp[[i]] <- data.table(tabs[[i]][F1==0 & dt!="NA", dt],
"nF")
colnames(temp[[i]])[1] <- i
}
# Names of the 1st column for each tabs (for some reason, I couldn't do it inside the loop)
for(i in 1:length(tabs)) {colnames(tabs[[i]])[1] <- names(tabs[i])}
This works fine... so far. Now let us look at temps and tabs:
#temp has the labels to be changed for each variable
temp
#tabs has the tables for each variable with respect to the outcome
tabs
And this is it. I'm stuck, I've been at this for 2 days and almost all Stackoverflow links are purple. Now I don't know how to do the rest.
- Did I overdo it? - Is there a better way?
- Any help you can send my way? would prefer data.table due to speed, but at this point, I won't complain.
Thanks, Aldo