2

I want to seperate variables according to a "lead" variable. x3 in the following case:

set.seed(2)
df = data.frame(x1 = sample(4), x2 = sample(4), x3 = sample(letters[1:2], size = 4, replace = TRUE))
df
#   x1 x2 x3
# 1  1  4  a
# 2  3  3  b
# 3  2  1  b
# 4  4  2  a

# Desired output
# x3 x1.a x2.a x1.b x2.b
#  a    1    4   NA   NA
#  b   NA   NA    3    3
#  b   NA   NA    2    1
#  a    4    2   NA   NA

I somehow sense that this could be achieved with reshape2::dcast() but I could only get it to work for two variables in total:

reshape2::dcast(df[,2:3], seq_along(x3) ~ x3, value.var = "x2")[, -1]
#    a  b
# 1  2 NA
# 2 NA  1
# 3 NA  3
# 4  4 NA

But may be this is just a total abuse of dcast. Is there an elegant solution to this problem, without splitting and merging df?


EDIT: Some people mentioned that to do this is a horrible idea and that i probably should not do such a thing. Let me elaborate on when this can make sense.

Imagine x3 is a switch for an specific algorithm. In this case a and b are the options. Furthermore x1 and x2 are parameters both algorithms can take. Unfortunately both algorithms behave really different on the same parameter settings for x1 and x2 so it makes sense to handle them as distinct features to take their uncorrolatedness in to account.

jakob-r
  • 6,824
  • 3
  • 29
  • 47
  • 5
    Well, there's `merge(df[df$x3=="a",], df[df$x3=="b",], by="x3", all=TRUE)`, but it's natural that there would be no elegant way to get such messy output. – Frank Sep 09 '16 at 14:12
  • @agenis That is fixed now. – jakob-r Sep 26 '16 at 14:53
  • @Frank This becomes messy if I don't now the discrete values of `x3`. – jakob-r Sep 26 '16 at 14:54
  • just one remark: you should probably have improved the title of your question to be more specific, if you wanted more "attention"; might save a bounty next time! ;-) – agenis Sep 26 '16 at 16:13
  • Is row and column order important? – Jonno Bourne Sep 26 '16 at 20:32
  • a generalisation of Franks comment... `Reduce(function(...) merge(..., by="x3", all=TRUE), split(df, df$x3))` – user20650 Sep 30 '16 at 11:09
  • The generalization looks really elegant but has some problems with duplicated column names (which can obviously fixed by renaming it after) but well as the solution given by @Frank it reorders the rows by "x3". – jakob-r Sep 30 '16 at 15:38
  • @jakobr; yup, but renaming the new column names would be straight forward as they are added in the order of the split, and you also know how many columns you are applying the transform over. and again trivial to order by the original row order: but this would be fairly slow approach – user20650 Sep 30 '16 at 15:44

4 Answers4

5

Here's a solution using the creation of dummy interaction terms with X3. It's probably possible to fit all this code into one single line by using dplyr or data.table but here it is:

temp <- model.matrix( ~ (x1+x2):x3-1, df)
temp[model.matrix( ~ (I(x1+1)+I(x2+1)):x3-1, df) == temp] <- NA
data.frame(df$x3, temp)
####  df.x3 x1.x3a x1.x3b x3a.x2 x3b.x2
####      a      1     NA      4     NA
####      b     NA      3     NA      3
####      b     NA      2     NA      1
####      a      4     NA      2     NA

The final name and order of the columns slightly differs from you.

Note: (purpose of the second line of code). the model.matrix function creates zeros instead of NAs, so there's no way to tell the difference with pre-existing zeros. The second line is a trick to find the final NAs only (it works by creating a second model matrix while altering its values by +1).

agenis
  • 8,069
  • 5
  • 53
  • 102
  • 1
    I like your model.matrix approach. I wonder if this is any clearer (although slower) : `m <- model.matrix(~ 0 + x3, df) ; m[m==0] <- NA ; data.frame(df$x3, apply(m, 2, "*", df[-3]))` – user20650 Sep 27 '16 at 21:00
  • 1
    @user20650 thanks for your input, that's indeed a nice alternative solution. You might post it as an answer. I was wondering if we could reduce this into a oneliner – agenis Sep 28 '16 at 08:11
  • 1
    well you could use also use `model.matrix(~ 0 + x3, df)^NA`,, rather than explicitly setting zero to NA. So in one line `data.frame(df$x3, apply(model.matrix(~ 0 + x3, df)^NA , 2, "*", df[-3]))`, but really cant be called a one liner as it is several calculations. – user20650 Sep 28 '16 at 13:47
  • The only thing that bothers me about this solution is that it get's a bit messy for a generic input where I only now the name of the categorical columns, because then I have to generate the formulas from strings. – jakob-r Sep 30 '16 at 15:53
  • @jakobr OK feel free to modify your question so I can take this specification into account, ;-) – agenis Sep 30 '16 at 22:24
3

This can be achieved using melt and dcast if you add one more column and do an intermediary melt.

library(reshape2)
library(magrittr)

set.seed(2)
df = data.frame(x1 = sample(4), x2 = sample(4), x3 = sample(letters[1:2], size = 4, replace = TRUE))

df$row <- 1:nrow(df)

melt(df, 
     id.vars = c("row", "x3"), 
     measure.vars = c("x1", "x2")) %>%
  dcast(row ~ x3 + variable, 
        value.var = "value")

However, it runs 2-3 times slower than agenis's solution, even when I push the size of the data frame as high as 10,000 rows. (8 vs. 16 milliseconds).

Benjamin
  • 16,897
  • 6
  • 45
  • 65
2

A basic solution I came up with myself:

cat.var = "x3"
cont.vars = setdiff(colnames(df), cat.var)
categories = unique(df[[cat.var]])
res = lapply(categories, function(x) {
  this.df = df[, cont.vars, drop = FALSE]
  this.df[df[[cat.var]] != x,] = NA
  setNames(this.df, paste0(x,".",colnames(this.df)))
})
res = do.call(cbind, c(list(df[, cat.var, drop=FALSE]), res))
res

#   x3 a.x1 a.x2 b.x1 b.x2
# 1  a    1    4   NA   NA
# 2  b   NA   NA    3    3
# 3  b   NA   NA    2    1
# 4  a    4    2   NA   NA
jakob-r
  • 6,824
  • 3
  • 29
  • 47
  • I clock this at about 3 times as fast as @agenis's solution, and 5 times as fast as mine. Nicely done. – Benjamin Sep 26 '16 at 16:13
1

You could use tidyr

library(tidyr);library(dplyr)
df <- df %>% mutate(rows=rownames(.)) %>% 
gather(., key="vars", value= "val", -x3,-rows) %>%
    mutate(vars= paste(x3,vars, sep=".")) %>%  
spread(., key = vars, value = val) %>%
select(-rows)

It gathers the data set into long form holding the x3 variable separate then, after creating the variable headers you require, spreads the data again.

Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45
  • Interesting solution but you have to note that it reorders the rows and sorts them by "x3". – jakob-r Sep 30 '16 at 15:33
  • True, if row order is important you can add `arrange(rows) %>%` before the select statement. It's still not the fastest though. – Jonno Bourne Sep 30 '16 at 15:51