59

[UPDATE: there is now a native transpose() function in data.table package]

I often need to transpose a data.table, every time it takes several lines of code and I am wondering if there's any better solution than mine.

if we take sample table

library(data.table)
mydata <- data.table(col0=c("row1","row2","row3"),
                     col1=c(11,21,31),
                     col2=c(12,22,32),
                     col3=c(13,23,33))

mydata
# col0 col1 col2 col3
# row1   11   12   13
# row2   21   22   23
# row3   31   32   33

and just transpose it with t(), it will be transposed to the matrix with conversion to character type, while applying data.table to such matrix will lose row.names:

t(mydata)
# [,1]   [,2]   [,3]  
# col0 "row1" "row2" "row3"
# col1 "11"   "21"   "31"  
# col2 "12"   "22"   "32"  
# col3 "13"   "23"   "33"  

data.table(t(mydata))
#   V1   V2   V3
# row1 row2 row3
#   11   21   31
#   12   22   32
#   13   23   33

so I had to write a function for this:

tdt <- function(inpdt){
  transposed <- t(inpdt[,-1,with=F]);
  colnames(transposed) <- inpdt[[1]];
  transposed <- data.table(transposed, keep.rownames=T);
  setnames(transposed, 1, names(inpdt)[1]);
  return(transposed);
}

 tdt(mydata)
# col0 row1 row2 row3
# col1   11   21   31
# col2   12   22   32
# col3   13   23   33

is there anything I could optimize here or do it in "nicer" way?

Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • 3
    I'm curious why you'd need to transpose a `data.table`? Transposing is a natural operation for numeric matrices but I have a hard time thinking of problems where it would be a good idea to transpose a by-definition non-uniform object like a `data.table` or a `data.frame`. Care to give an example? – eddi Feb 22 '15 at 03:55
  • 2
    well, most of my experimental data files are tables where main part is indeed numeric matrix, but accompanied by several non-numeric fields, so using `data.table` is quite logical for me. Most typical example would be the table where each column is a sample and each row is a probe; characteristics of probes and/or samples are recorded in the same table. – Vasily A Feb 22 '15 at 04:33
  • can you fread with header = FALSE and then use transpose? – Rivka Mar 07 '19 at 10:31
  • @eddi one example of where transposing a data.frame is useful: in single cell genomics two concepts of storing data are present. one with genes in rows and cells in columns and one vice versa. so if you want to use different software packages you would need to transpose.. – Tapper Aug 23 '19 at 16:07
  • Given that there now is the native `data.table` function `transpose`, it might be worth re-considering what answer to accept. – gofvonx Jan 21 '20 at 11:54
  • thanks @gofvonx, I've added this info. – Vasily A Jan 25 '20 at 06:14

6 Answers6

62

Why not just melt and dcast the data.table?

require(data.table)

dcast(melt(mydata, id.vars = "col0"), variable ~ col0)
#    variable row1 row2 row3
# 1:     col1   11   21   31
# 2:     col2   12   22   32
# 3:     col3   13   23   33
buhtz
  • 10,774
  • 18
  • 76
  • 149
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    hmmm, it looks neat, thanks! (I will keep the question open for a while to see if there are other suggestions) – Vasily A Feb 22 '15 at 04:21
  • @VasilyA, I think that the main question comes back to why you would be using a `data.table` for `matrix`-type data. Are you doing a lot of work with grouping and so on? – A5C1D2H2I1M1N2O1R2T1 Feb 23 '15 at 05:22
  • yes, I use a lot of grouping and "subsetting", often need to select only a part of that big matrix. – Vasily A Feb 23 '15 at 19:52
  • @buhtz, I suppose you know that you should use `library(data.table)` before `dcast` and `melt`? If that doesn't solve your problem, please share the version of "data.table" that you're using. – A5C1D2H2I1M1N2O1R2T1 Dec 20 '18 at 08:58
39

The current docs show a builtin transpose method.

Specifically, you can do:

transpose(mydata, keep.names = "col", make.names = "col0")
##     col row1 row2 row3
## 1: col1   11   21   31
## 2: col2   12   22   32
## 3: col3   13   23   33
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
abalter
  • 9,663
  • 17
  • 90
  • 145
  • 9
    This should be the way to go. But at the moment it doesn't keep the column names as rownames. There is a issue opened https://github.com/Rdatatable/data.table/issues/1886 to ask for it. I guess if we vote there it will be implemented sooner. – skan Mar 08 '17 at 11:46
  • Is there a way to keep numeric as they are instead of converting to characters? – Herman Toothrot May 27 '20 at 17:40
  • 2
    In case anybody else had trouble with it: I had to call it explicitly as `data.table::transpose()` to not get purrr's transpose(). – Tapper Mar 19 '21 at 17:05
21

Here's an alternative solution that only uses data.table and that is closer to the original idea of using t to transpose.

mydata[, data.table(t(.SD), keep.rownames=TRUE), .SDcols=-"col0"]
##      rn V1 V2 V3
## 1: col1 11 21 31
## 2: col2 12 22 32
## 3: col3 13 23 33

If keeping the rownames is important, setnames can be used. Admittedly this becomes a bit clumsy and probably the recast solution is preferable.

setnames(mydata[, data.table(t(.SD), keep.rownames=TRUE), .SDcols=-"col0"], 
         mydata[, c('rn', col0)])[]
##      rn row1 row2 row3
## 1: col1   11   21   31
## 2: col2   12   22   32
## 3: col3   13   23   33
shadow
  • 21,823
  • 4
  • 63
  • 77
  • 3
    This approach is certainly easier to understand. By the way, release notes for v1.9.5 of `data.table` suggest that `reshape2` will soon not be required for `melt` and `dcast`. https://github.com/Rdatatable/data.table#changes-in-v195--in-development-on-github – Frank Feb 23 '15 at 19:51
  • thanks, @shadow! I like that we only use `data.table`, but I should note that column names (i.e. original row names - "row1","row2","row3") - are lost here. – Vasily A Feb 23 '15 at 19:59
8
df <- as.data.frame(t(mydata))

is what I tried and df is a data.frame and the column names on mydata are now row names on df

ascripter
  • 5,665
  • 12
  • 45
  • 68
Japa19
  • 81
  • 1
  • 1
2

Here's a solution that uses a wrapper to tidy up the output of the data.table transpose function.

With really large data sets this seems to be more efficient than the dcast/melt approach (I tested it on a 8000 row x 29000 column data set, the below function works in about 3 minutes but dcast/melt crashed R):

# Function to clean up output of data.table transpose:

transposedt <- function(dt, varlabel) {
  require(data.table)
  dtrows = names(dt)
  dtcols = as.list(c(dt[,1]))
  dtt = transpose(dt)
  dtt[, eval(varlabel) := dtrows]
  setnames(dtt, old = names(dtt), new = c(dtcols[[1]], eval(varlabel)))
  dtt = dtt[-1,]
  setcolorder(dtt, c(eval(varlabel), names(dtt)[1:(ncol(dtt) - 1)]))
  return(dtt)
}

# Some dummy data 
mydt <- data.table(col0 = c(paste0("row", seq_along(1:100))), 
                   col01 = c(sample(seq_along(1:100), 100)),
                   col02 = c(sample(seq_along(1:100), 100)),
                   col03 = c(sample(seq_along(1:100), 100)),
                   col04 = c(sample(seq_along(1:100), 100)),
                   col05 = c(sample(seq_along(1:100), 100)),
                   col06 = c(sample(seq_along(1:100), 100)),
                   col07 = c(sample(seq_along(1:100), 100)),
                   col08 = c(sample(seq_along(1:100), 100)),
                   col09 = c(sample(seq_along(1:100), 100)),
                   col10 = c(sample(seq_along(1:100), 100)))


# Apply the function:
mydtt <- transposedt(mydt, "myvariables")

# View the results:
> mydtt[,1:10]
    myvariables row1 row2 row3 row4 row5 row6 row7 row8 row9
 1:       col01   58   53   14   96   51   30   26   15   68
 2:       col02    6   72   46   62   69    9   63   32   78
 3:       col03   21   36   94   41   54   74   82   64   15
 4:       col04   68   41   66   30   31   78   51   67   26
 5:       col05   49   30   52   78   73   71    5   66   44
 6:       col06   89   35   79   67    6   88   62   97   73
 7:       col07   66   15   27   29   58   40   35   82   57
 8:       col08   55   47   83   30   23   65   48   56   87
 9:       col09   41   10   21   33   55   81   94   25   34
10:       col10   35   17   41   44   21   66   69   61   46

What is also useful is that columns (ex rows) occur in their original order and you can name the variables column something meaningful.

Amy M
  • 967
  • 1
  • 9
  • 19
-1

The tdt function which I provide below should be faster

tdt <- function(DT, transpose.col, ...) {
# The transpose function is efficient, but lacks the keeping of row and colnames
new.row.names <- colnames(DT)
new.row.names <- new.row.names[!new.row.names %in% transpose.col]
new.col.names <- DT[, transpose.col, with = F]
DT <- DT[, !colnames(DT) %in% transpose.col, with = F]
DT <- transpose(DT, ...)
colnames(DT) <- unlist(new.col.names)
DT$var <- new.row.names
# change order of DT after transposing 
setcolorder(DT, c("var", setdiff(names(DT), "var")))
colnames(DT)[1] <- transpose.col
return(DT)
}

library(microbenchmark); library(microbenchmarkCore)
DT <- data.table(x=1:1000, y=paste("name", 1:1000, sep = "_"), z = paste("test", 1:1000, sep = "."))

rbind(microbenchmark(tdt(DT, "y")), 
microbenchmark(dcast(melt(DT, id.vars = "y"), variable ~ y)),
microbenchmark(DT[, data.table(t(.SD), keep.rownames=TRUE), .SDcols=-"y"]))

Unit: milliseconds
expr       min        lq      mean    median        uq        max neval cld

tdt(DT, "y")  3.463842  3.719341  4.308158  3.911599  4.576477  20.406940   100  a 

dcast(melt(DT, id.vars = "y"), variable ~ y)  5.146119  5.496761  5.826647  5.580796  5.870584   9.536541   100  a 

DT[, data.table(t(.SD), keep.rownames = TRUE), .SDcols = -"y"] 29.975567 34.554989 40.807036 36.724430 39.102396 104.242218   100   b

d <- tdt(DT, "y") 

d[1:2, 1:11]
y name_1 name_2 name_3 name_4 name_5 name_6 name_7 name_8 name_9 name_10
1: x      1      2      3      4      5      6      7      8      9      10
2: z test.1 test.2 test.3 test.4 test.5 test.6 test.7 test.8 test.9 test.10

DT[1:10, 1:3]
x       y       z
1:  1  name_1  test.1
2:  2  name_2  test.2
3:  3  name_3  test.3
4:  4  name_4  test.4
5:  5  name_5  test.5
6:  6  name_6  test.6
7:  7  name_7  test.7
8:  8  name_8  test.8
9:  9  name_9  test.9
10: 10 name_10 test.10

class(d)
[1] "data.table" "data.frame"