0

A bigish data.table with 750k rows and almost 200 columns, but this will do:

dt <- data.table(id = 1:15,
             outcome = factor(c(0, 0, 0, 1, 1, 3, 3, 3, 3, 3, 0, 3, 1, 1, 3),
                              labels = c("F0","F1","F3")),
             var1 = c(0, 0, 0, 1, 1, 2, 2, 2, NA, NA, 0, 0, 0, 0, 0),
             var2 = c(0, 0, NA, NA, NA, 0, 0, 0, 0, 0, 6, 6, 4, 4, 4))

I want to change/group the "labels" of the variables var1, var2, (and any other variable) by a condition in the outcome variable. A table explains visually what I want to change

xtabs(~var1+outcome, dt, addNA = TRUE)
xtabs(~var2+outcome, dt, addNA = TRUE)

When outcome==F1 has no occurrences, the labels of var1, var2, and any other variables should be grouped. From the 1st table: change 2; from the 2nd table, group 0 and 6.

If the number of levels and variables were small, I could do this by hand with one liners:

dt$var1[dt$var1==2] <- "nF"                                  #data frame way
dt[, var1 := as.character(var1)][var1 == "2", var1 := "nF"]  #data.table way
xtabs(~var1+outcome, dt, addNA = TRUE)                       #check

      outcome
var1   F0 F1 F3
  0     4  2  2
  1     0  2  0
  nF    0  0  3
  <NA>  0  0  2

Both of these one liners work, but as you can imagine, with 200 columns and some variables with over a thousand levels, this is impossible.

So I came up with an idea:

  1. build a table for every variable with respect to the outcome
  2. get the labels where outcome==F1 has 0 occurrences
  3. change the variable with an ifelse statement

Step 1

#rebuild dt to try this
(temp1 <- dcast(data = dt,
           formula = var2 ~ outcome,
           value.var = "outcome",
           fun.aggregate = length))

Step 2

tempvar <- temp1[F1==0 & var2!="NA", var2]

Step 3

dt[, var2 := ifelse(var2 %in% tempvar, "nF", var2)]
xtabs(~var2+outcome, dt, addNA = TRUE)                       #check


      outcome
var2   F0 F1 F3
  4     0  2  1
  nF    3  0  6
  <NA>  1  2  0

This also works, and I avoid having to go through all those labels. So I made it a for loop... with a very cool counter =p

# Initialize
tabs <- c()
temp <- c()
counter <- 0

for (i in colnames(dt[, c("var1", "var2")])) {

# counter & progress
counter <- counter + 1
cat("Variable: ", counter, "of", ncol(dt), " ", i, "\n")

# build tables for each variable with dcast
tabs[[i]] <- dcast(data = dt,
                 formula = dt[[i]] ~ outcome,
                 value.var = "outcome",
                 fun.aggregate = length)

# temp: labels to group & set name
temp[[i]] <- data.table(tabs[[i]][F1==0 & dt!="NA", dt],
                      "nF")
colnames(temp[[i]])[1] <- i
}

# Names of the 1st column for each tabs (for some reason, I couldn't do it inside the loop)
for(i in 1:length(tabs)) {colnames(tabs[[i]])[1] <- names(tabs[i])}

This works fine... so far. Now let us look at temps and tabs:

#temp has the labels to be changed for each variable
temp

#tabs has the tables for each variable with respect to the outcome
tabs

And this is it. I'm stuck, I've been at this for 2 days and almost all Stackoverflow links are purple. Now I don't know how to do the rest.

  • Did I overdo it? - Is there a better way?
  • Any help you can send my way? would prefer data.table due to speed, but at this point, I won't complain.

Thanks, Aldo

FG7
  • 469
  • 4
  • 14
504aldo
  • 57
  • 6

1 Answers1

1

If I understood correctly, this does what you want:

library(data.table)

dt <- data.table(id = 1:15,
                 outcome = factor(c(0, 0, 0, 1, 1, 3, 3, 3, 3, 3, 0, 3, 1, 1, 3),
                                  labels = c("F0","F1","F3")),
                 var1 = as.character(c(0, 0, 0, 1, 1, 2, 2, 2, NA, NA, 0, 0, 0, 0, 0)),
                 var2 = as.character(c(0, 0, NA, NA, NA, 0, 0, 0, 0, 0, 6, 6, 4, 4, 4)))

long <- melt(dt, "outcome", setdiff(names(dt), c("id", "outcome")))

to_group <- long[, .(dummy = .N), by = .(outcome, variable, value)
                 ][, .(value = setdiff(value, c(NA, value[outcome == "F1"]))), by = "variable"]

for (var in unique(to_group$variable)) {
  dt[list(to_group[variable == var, value]), (var) := "nF", on = var]
}

dt[]
    id outcome var1 var2
 1:  1      F0    0   nF
 2:  2      F0    0   nF
 3:  3      F0    0 <NA>
 4:  4      F1    1 <NA>
 5:  5      F1    1 <NA>
 6:  6      F3   nF   nF
 7:  7      F3   nF   nF
 8:  8      F3   nF   nF
 9:  9      F3 <NA>   nF
10: 10      F3 <NA>   nF
11: 11      F0    0   nF
12: 12      F3    0   nF
13: 13      F1    0    4
14: 14      F1    0    4
15: 15      F3    0    4

Changing to long format with melt makes it easier to apply the subsequent logic for each var* column. For your demo data, long looks like this:

> head(long)
   outcome variable value
1:      F0     var1     0
2:      F0     var1     0
3:      F0     var1     0
4:      F1     var1     1
5:      F1     var1     1
6:      F3     var1     2

Therefore, you can think of the [, .(dummy = .N), by = .(outcome, variable, value)] frame as a "distinct" operation. It will create something similar to xtabs but without adding 0 to non-existing combinations.

The next frame simply takes the set of all values for each var* and removes those that appear at the same time as outcome == "F1", as well as NA. This is like computing the 0 counts for values that never appear when outcome is F1.

The code in the for-loop uses secondary indices notation. For each var* column, it searches for rows where the values match what is present in to_group$value, and replaces said values with "nF".

I'm not sure if that's the most efficient, but since you said you wanted to modify the original dt (possibly to keep id), this is what I came up with. You might want to rm(long) in the end.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • **this one works**. I see you changed to `as.character` before everything; `rm(long)` will be _necessary_ with my original dataset. After a good night sleep,I thought I should try with melt. I'm very close with my method, but your answer works. Also, **secondary indices notation; TIL** – 504aldo Sep 07 '19 at 00:53