2

I have a dataset (data) that looks like this:

ID,ABC.BC,ABC.PL,DEF.BC,DEF.M,GHI.PL
SB0005,C01,D20,C01a,C01b,D20
BC0013,C05,D5,C05a,NA,D5

I want to reshape it from wide-to-long format to get something like this:

ID,FC,Type,Var
SB0005,ABC,BC,C01
SB0005,ABC,PL,D20
SB0005,DEF,BC,C01a
SB0005,DEF,M,C01b
SB0005,GHI,PL,D20
BC0013,ABC,BC,C05
BC0013,ABC,PL,D5
BC0013,DEF,BC,C05a
# BC0013,DEF,M,NA (This row need not be in the dataset as I will remove it later)
BC0013,GHI,PL,D5

The usual reshape package does not work as the dataset is unbalanced. I also tried Reshape from splitstackshape but it does not give me what I want.

library(splitstackshape)
vary <- grep("\\.BC$|\\.PL$|\\.M$", names(data))
stubs <- unique(sub("\\..*$", "", names(data[vary])))
Reshape(data, id.vars=c("ID"), var.stubs=stubs, sep=".")

ID,time,ABC,DEF,GHI
SB0005,1,C01,C01a,D20
BC0013,1,C05,C05a,D5
SB0005,2,D20,C01b,NA
BC0013,2,D5,NA,NA
SB0005,3,NA,NA,NA
BC0013,3,NA,NA,NA

Appreciate any suggestions, thanks!

Providing the output of dput(data) as requested

structure(list(ID = structure(c(2L, 1L), .Label = c("BC0013", 
"SB0005"), class = "factor"), ABC.BC = structure(1:2, .Label = c("C01", 
"C05"), class = "factor"), ABC.PL = structure(1:2, .Label = c("D20", 
"D5"), class = "factor"), DEF.BC = structure(1:2, .Label = c("C01a", 
"C05a"), class = "factor"), DEF.M = structure(1:2, .Label = c("C01b", 
"NA"), class = "factor"), GHI.PL = structure(1:2, .Label = c("D20", 
"D5"), class = "factor")), .Names = c("ID", "ABC.BC", "ABC.PL", 
"DEF.BC", "DEF.M", "GHI.PL"), row.names = c(NA, -2L), class = "data.frame")
Jaap
  • 81,064
  • 34
  • 182
  • 193
phusion
  • 97
  • 1
  • 9
  • Please provide the output of `dput(data)` in your question so we can reproduce your efforts. – Chrisss Sep 27 '16 at 06:04
  • How is it unbalanced? Do you mean the `NA` that you wish to drop? Also, should the last row of expected output have `D5` not `D20`? – mathematical.coffee Sep 27 '16 at 06:12
  • You're right, I corrected the error thanks. It's unbalanced because BC, PL and M does not appear in all FC, e.g. BC appears in ABC and DEF, not GHI. – phusion Sep 27 '16 at 06:20

1 Answers1

3

You need to reshape your data into long format first and then you can spit the variable column into to columns. With splitstackshape you could do:

library(splitstackshape) # this will also load 'data.table' from which the 'melt' function is used
cSplit(melt(mydf, id.vars = 'ID'), 
       'variable', 
       sep = '.', 
       direction = 'wide')[!is.na(value)]

which results in:

                          ID value variable_1 variable_2
1:                    SB0005   C01        ABC         BC
2:                    BC0013   C05        ABC         BC
3:                    SB0005   D20        ABC         PL
4:                    BC0013    D5        ABC         PL
5:                    SB0005  C01a        DEF         BC
6:                    BC0013  C05a        DEF         BC
7:                    SB0005  C01b        DEF          M
8:                    SB0005   D20        GHI         PL
9:                    BC0013    D5        GHI         PL

An alternative with tidyr:

library(tidyr)
mydf %>% 
  gather(var, val, -ID) %>% 
  separate(var, c('FC','Type')) %>% 
  filter(!is.na(val))
Jaap
  • 81,064
  • 34
  • 182
  • 193