Reshaping data in R (wide -> Long)

Question

I want to convert df1 to df2

Old sample data frame df1

df1 <- structure(list(ID = 1:2,                Group = c(1L, 1L),
                      M1a2hB = c(0.2, 0.3),    M1a3hB = c(0.4, 0.6),
                      M2a2hB = c(0.3, 0.4),    M2a3hB = c(0.6, 0.6),
                      M1r2hB = c(200L, 300L),  M1r3hB = c(400L, 600L),
                      M2r2hB = c(300L, 400L),  M2r3hB = c(600L, 600L)),
                 .Names = c("ID", "Group", "M1a2hB", "M1a3hB", "M2a2hB",
                            "M2a3hB","M1r2hB", "M1r3hB","M2r2hB", "M2r3hB"),
                 class = "data.frame", row.names = c(NA, -2L))

ID Group M1a2hB M1a3hB M2a2hB M2a3hB.... M1r2hB M1r3hB M2r2hB M2r3hB ...
1   1      0.2  0.4    0.3   0.6    ...     200    400   300    600    ...
2   1      0.3  0.6    0.4   0.6    ...     300    600   400    600    ...

Here, df1 has 100 IDs and 1100 Columns. Each outcome measure has two columns for absolute change and two for relative change. There are nearly 270 outcome measures.

M1a2hB is absolute change in first measure from time 2 to baseline and M1a3hB is absolute change for time 3 to baseline. Similarly, M1r2hB is relative change in first outcome from time 2 to baseline and M1r3hB is relative change in an outcome from time 3 to baseline.

New df2:

ID Group time  M1a           M2a        ...  M1r           M2r        ...
1  1     1     0.0           0.0        ...  000           000         ...
1  1     2     0.2           0.3        ...  200           300         ...
1  1     3     0.4           0.6        ...  400           600         ...
2  1     1     0.0           0.0        ...  000           000         ...
2  1     2     0.3           0.4        ...  300           400         ...
2  1     3     0.6           0.6        ...  600           600         ...

Any tips? Feel free to ask for any clarification. Thanks! Looking forward!

p.s. I have tried to run few codes from previous posts (pls see below if interested), but they seem different because df is three dimensional data, and df2 includes additional time column

In R, plotting wide form data with ggplot2 or base plot. Is there a way to use ggplot2 without melting wide form data frame?

Reshaping repeated measures data in R wide to long

It shouldn't be too difficult for you to make some sample data that replicates your actual problem. That would make it much easier for others to suggest answers or to let you know if there's already an answer available at SO. — A5C1D2H2I1M1N2O1R2T1, Mar 12 '16 at 11:17
@AnandaMahto - I don't understand you completely, because I have already posted sample data that replicated the actual question. — Aby, Mar 12 '16 at 11:21

akrun · Accepted Answer · 2016-03-12T11:32:16.537

1

We can extract the patterns from the column names using sub, split the sequence of that vector with 'nm1', use that as measure in melt to convert from 'wide' to 'long' format.

library(data.table)
nm1 <- sub("\\d+[[:alpha:]]+$", '', names(df1)[-(1:2)])
lst <- split(seq_along(nm1)+2, nm1)
melt(setDT(df1), measure = lst, 
       value.name= names(lst), variable.name= 'time')[order(ID)]
#   ID Group time M1a M1r M2a M2r
#1:  1     1    1 0.2 200 0.3 300
#2:  1     1    2 0.4 400 0.6 600
#3:  2     1    1 0.3 300 0.4 400
#4:  2     1    2 0.6 600 0.6 600

data

df1 <- structure(list(ID = 1:2, Group = c(1L, 1L),
  M1a2hB = c(0.2, 0.3
), M1a3hB = c(0.4, 0.6), M2a2hB = c(0.3, 0.4),
 M2a3hB = c(0.6, 
0.6), M1r2hB = c(200L, 300L), M1r3hB = c(400L, 600L), 
M2r2hB = c(300L, 
400L), M2r3hB = c(600L, 600L)), .Names = c("ID", "Group", "M1a2hB", 
"M1a3hB", "M2a2hB", "M2a3hB", "M1r2hB", "M1r3hB",
"M2r2hB", "M2r3hB"
), class = "data.frame", row.names = c(NA, -2L))

edited Mar 12 '16 at 11:32

answered Mar 12 '16 at 11:18

akrun

874,273
37
540
662

many thanks. Glad you understood my question, your code works on sample data :D Could you pls explain three steps of your code after loading library? Learning by trying. As you know, I have ~100 `ID`, and 270 `M` with different columns names, how shall I proceed? Regarding `names(df1)` my first outcome measure (M1a2hB) in on column 3, then (M1a3hB) on 261 column, 519 on M1r2hB and 777 on M1r3hB. Second outcome measure (M2a2hB) in on column 4, then (M2a3hB) on 262 column, 520 on M2r2hB and 778 on M2r3hB, and so on. I think, order by ID (without Group) alone might work – Aby Mar 12 '16 at 11:48
1

@AmitBansal I can only answer for the example you provided (that too with these `...` makes it so difficult). I assume that you have the same pattern. Anyway, this answers the question posted with the example showed. – akrun Mar 12 '16 at 11:50

score 0 · Answer 2 · answered Mar 12 '16 at 17:39

Here is an answer using tidyr:

library(dplyr)
library(tidyr)
library(rex)

string_interpretation = 
  rex(capture("M", 
              digits, 
              or("a", "r")), 
      capture(digits))

result = 
  df1 %>%
  gather(string, value, -ID, -Group) %>%
  extract(string, c("variable", "time"), string_interpretation) %>%
  spread(variable, value)

score 0 · Answer 3 · answered Nov 29 '16 at 21:38

Built-in base::reshape can do this fairly well:

df1 <- structure(list(ID = 1:2,                Group = c(1L, 1L),
                      M1a2hB = c(0.2, 0.3),    M1a3hB = c(0.4, 0.6),
                      M2a2hB = c(0.3, 0.4),    M2a3hB = c(0.6, 0.6),
                      M1r2hB = c(200L, 300L),  M1r3hB = c(400L, 600L),
                      M2r2hB = c(300L, 400L),  M2r3hB = c(600L, 600L)),
                 .Names = c("ID", "Group", "M1a2hB", "M1a3hB", "M2a2hB",
                            "M2a3hB","M1r2hB", "M1r3hB","M2r2hB", "M2r3hB"),
                 class = "data.frame", row.names = c(NA, -2L))

df1

#  ID Group M1a2hB M1a3hB M2a2hB M2a3hB M1r2hB M1r3hB M2r2hB M2r3hB
#   1     1    0.2    0.4    0.3    0.6    200    400    300    600
#   2     1    0.3    0.6    0.4    0.6    300    600    400    600

df2 <- reshape(df1, varying=list(c(3,4),c(5,6),c(7,8),c(9,10)),
        v.names=c("M1a", "M2a", "M1r", "M2r"),
        timevar="time", times=2:3, direction="long")

df2

#   ID Group time M1a M2a M1r M2r id
#    1     1    2 0.2 0.3 200 300  1
#    2     1    2 0.3 0.4 300 400  2
#    1     1    3 0.4 0.6 400 600  1
#    2     1    3 0.6 0.6 600 600  2

If you have n <- 270 measurements at m <- 2 time points(2h, 3h), change the parameters of reshape to

varying=split(1:(n*m*2)+2,rep(1:(n*2), each=m))  # `*2` accounts for doubling by relative and absolute measurements.
                                                 # `+2` accounts for the `ID` and `Group` columns at the beginning 

v.names=c(paste0("M", 1:n, "a"), paste0("M", 1:n, "r"))

I assume time==1 in your example df2 refers to measurements at baseline, not an unmentioned 1h since they appear to be all zeros. For clarity, I'll show baseline as time==0. One way to get baseline to show up in in df2 would be to add the zero-value baseline measurements into df1.

n <- 2  # use n <- 270 for 270 outcomes, measured at each time point, reported both in absolute and relative terms

df1.5 <- data.frame(df1,
    setNames(as.list(rep(0,2*n)), c(paste0("M", 1:n, "a0hB"), paste0("M", 1:n, "r0hB"))))

df2 <- reshape(df1.5, varying=split(1:(n*3*2)+2, c(rep(1:(n*2), each=2), 1:(n*2))),
        v.names=c(paste0("M", 1:n, "a"), paste0("M", 1:n, "r")),
        timevar="time", idvar=c("Group", "ID"), times=c(2,3,0), direction="long")

#  ID Group time M1a M2a M1r M2r
#   1     1    2 0.2 0.3 200 300
#   2     1    2 0.3 0.4 300 400
#   1     1    3 0.4 0.6 400 600
#   2     1    3 0.6 0.6 600 600
#   1     1    0 0.0 0.0   0   0
#   2     1    0 0.0 0.0   0   0

And sort it.

df2.sorted <- df2[order(df2$Group, df2$ID, df2$time),]

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

you can use my r package onetree, which is uploaded to my github yikeshu0611.

install.packages("devtools") #if you didnot have devtools packages in r
library(devtools)
install_github("yikeshu0611/onetree") #install onetree package from github

1. step by step

First, I will teach you how to convert wide to long step by step.

library(onetree)
long1=reshape_toLong(data=df1, 
                      id= "ID", 
                      j="newcolumn", 
       value.var.prefix=c("M1a","M2a","M1r","M2r")

In this command, j is the name of new column. you will get the result long1 below

long1

ID Group newcolumn M1a M2a M1r M2r
1     1       2hB 0.2 0.3 200 300
1     1       3hB 0.4 0.6 400 600
2     1       2hB 0.3 0.4 300 400
2     1       3hB 0.6 0.6 600 600

Further, we can see in data long1, M1a, M2a-------, M1r, M2r-----. the data is still a wide data. We can still convert it to long. We use M1, M2 as prefix. a and r as new column, which is test way. Command is below.

long2=reshape_toLong(data = long1,
                       id = c("ID","newcolumn"),
                        j = "testway",
        value.var.prefix = c("M1","M2"))
long2
   ID newcolumn Group testway    M1    M2
1  1       2hB     1       a   0.2   0.3
2  1       2hB     1       r 200.0 300.0
3  1       3hB     1       a   0.4   0.6
4  1       3hB     1       r 400.0 600.0
5  2       2hB     1       a   0.3   0.4
6  2       2hB     1       r 300.0 400.0
7  2       3hB     1       a   0.6   0.6
8  2       3hB     1       r 600.0 600.0

Here, we use two variable ID and newcolumn as id object. Because in long data, id is treated as a unique variable, if we only use ID, missmatch will happen. Also you can create a new id, ex: idnew.

long1$idnew = 1:nrow(long1)
reshape_toLong(data = long1,
                 id = "idnew",
                 j = "testway",
            value.var.prefix = c("M1","M2"))

Let's going on! In data long2, there may be M1, M2,-------. So long2 is still a wide data. Yeah, we can change is to long data. M as prefix, 1,2,3,-----as new column. But, id should be ID, newcolumn and testway or you can create a new id to long2, which will ensure id unique.

long3=reshape_toLong(data = long2,
                 id = c("ID","newcolumn","testway"),
                 j = "testnumber",
                 value.var.prefix = "M")
long3
   ID newcolumn testway Group testnumber     M
1   1       2hB       a     1          1   0.2
2   1       2hB       a     1          2   0.3
3   1       2hB       r     1          1 200.0
4   1       2hB       r     1          2 300.0
5   1       3hB       a     1          1   0.4
6   1       3hB       a     1          2   0.6
7   1       3hB       r     1          1 400.0
8   1       3hB       r     1          2 600.0
9   2       2hB       a     1          1   0.3
10  2       2hB       a     1          2   0.4
11  2       2hB       r     1          1 300.0
12  2       2hB       r     1          2 400.0
13  2       3hB       a     1          1   0.6
14  2       3hB       a     1          2   0.6
15  2       3hB       r     1          1 600.0
16  2       3hB       r     1          2 600.0

Now, data long3 is an absolutely long data.

prefix is very import, we use prefixes as below

first: M1a, M2a, M1r, M2r
second: M1, M2
third: M

we change id three times, to make it unique

first: ID
second: ID, newcolumn
thrid: ID, newcolumn, testway

j is new column

first: newcolumn
second: testway
third: testnumber

2. A little bit faster

If each measure outcome has 4 outcomes: a2, a3,r2 r3. a: absolut, r: relative, 2: time 2, 3: time 3. Then 1100 columns has 275 measure outcomes(1100/4). So, we have M1a2hB, M2a2hB, M3a2hB------M275a2hB. and M1a3hB, M2a3hB, M3a3hB------M275a3hB, and M3 is like that. IF we use command like that, we will has a much long value.var.prefix. However, we can use faster way to construct prefix by paste0 function.

ma2=paste0("M",1:275,"a")
ma3=paste0("M",1:275,"a")
mr2=paste0("M",1:275,"r")
mr3=paste0("M",1:275,"r")
m=c(ma2,ma3,mr2,mr3)

In df1, we only has 2 measure outcomes, so we can use command below

ma2=paste0("M",1:2,"a")
ma3=paste0("M",1:2,"a")
mr2=paste0("M",1:2,"r")
mr3=paste0("M",1:2,"r")
prefix=c(ma2,ma3,mr2,mr3)

reshape_toLong(data = df1,
                id = "ID",
                 j = "newcolumn",
  value.var.prefix = prefix)

  ID Group newcolumn M1a M2a M1r M2r
1  1     1       2hB 0.2 0.3 200 300
2  1     1       3hB 0.4 0.6 400 600
3  2     1       2hB 0.3 0.4 300 400
4  2     1       3hB 0.6 0.6 600 600

Still, we can use M1, M2----- as prefix, we change a2hB, a3hB, r2hB, r3hB to new column. Then we substring the new column to different columns.

m1=paste0("M",1:2)
m2=paste0("M",1:2)
prefix=c(m1,m2)

long4=reshape_toLong(data = df1,
                id = "ID",
                 j = "newcolumn",
  value.var.prefix = prefix)
long4
  ID Group newcolumn    M1    M2
1  1     1      a2hB   0.2   0.3
2  1     1      a3hB   0.4   0.6
3  1     1      r2hB 200.0 300.0
4  1     1      r3hB 400.0 600.0
5  2     1      a2hB   0.3   0.4
6  2     1      a3hB   0.6   0.6
7  2     1      r2hB 300.0 400.0
8  2     1      r3hB 600.0 600.0

long4$testway=Left(long4$newcolumn,1)
long4$time=Right(long4$newcolumn,3)
long4
  ID Group newcolumn    M1    M2 testway time
1  1     1      a2hB   0.2   0.3       a  2hB
2  1     1      a3hB   0.4   0.6       a  3hB
3  1     1      r2hB 200.0 300.0       r  2hB
4  1     1      r3hB 400.0 600.0       r  3hB
5  2     1      a2hB   0.3   0.4       a  2hB
6  2     1      a3hB   0.6   0.6       a  3hB
7  2     1      r2hB 300.0 400.0       r  2hB
8  2     1      r3hB 600.0 600.0       r  3hB

Last, we can only use M as prefix, to get the absolutely data.

long5=reshape_toLong(data = df1,
                       id = "ID",
                        j = "newcolumn",
         value.var.prefix = "M")
long5
   ID Group newcolumn     M
1   1     1     1a2hB   0.2
2   1     1     1a3hB   0.4
3   1     1     2a2hB   0.3
4   1     1     2a3hB   0.6
5   1     1     1r2hB 200.0
6   1     1     1r3hB 400.0
7   1     1     2r2hB 300.0
8   1     1     2r3hB 600.0
9   2     1     1a2hB   0.3
10  2     1     1a3hB   0.6
11  2     1     2a2hB   0.4
12  2     1     2a3hB   0.6
13  2     1     1r2hB 300.0
14  2     1     1r3hB 600.0
15  2     1     2r2hB 400.0
16  2     1     2r3hB 600.0

Then we can use Left, Mid and Right function in onetree package to substring from left, mid and right to get new columns.

long5$testnumber=Left(long5$newcolumn,1)
long5$testway=Mid(long5$newcolumn,2,1)
long5$time=Right(long5$newcolumn,3)
long5
   ID Group newcolumn     M testnumber testway time
1   1     1     1a2hB   0.2          1       a  2hB
2   1     1     1a3hB   0.4          1       a  3hB
3   1     1     2a2hB   0.3          2       a  2hB
4   1     1     2a3hB   0.6          2       a  3hB
5   1     1     1r2hB 200.0          1       r  2hB
6   1     1     1r3hB 400.0          1       r  3hB
7   1     1     2r2hB 300.0          2       r  2hB
8   1     1     2r3hB 600.0          2       r  3hB
9   2     1     1a2hB   0.3          1       a  2hB
10  2     1     1a3hB   0.6          1       a  3hB
11  2     1     2a2hB   0.4          2       a  2hB
12  2     1     2a3hB   0.6          2       a  3hB
13  2     1     1r2hB 300.0          1       r  2hB
14  2     1     1r3hB 600.0          1       r  3hB
15  2     1     2r2hB 400.0          2       r  2hB
16  2     1     2r3hB 600.0          2       r  3hB

Here, we use different prefix to get different data.

first: use paste0 function to construct
second: M1、M2、M3-------, still paste0 fucntion but more simple
third: we use only M
we did not change id and j

3. Conclusion

In reshape_toLong function:

data: is the data that you want to transform
id: is the unique id variable, which can be one variable or more
j: is new variable name, that you want to stack the time or sequence number
value.var.prefix: is the prefix of value variable