13

I am dealing with a simple table of the form

date         variable   value
1970-01-01   V1         0.434
1970-01-01   V2         12.12
1970-01-01   V3         921.1
1970-01-02   V1         -1.10
1970-01-03   V3         0.000
1970-01-03   V5         312e6
...          ...        ...

The pairs (date, variable) are unique. I would like to transform this table into a wide-form one.

date         V1         V2         V3         V4         V5        
1970-01-01   0.434      12.12      921.1      NA         NA
1970-01-02   -1.10      NA         NA         NA         NA
1970-01-03   0.000      NA         NA         NA         312e6

And I would like to do it in the fastest possible way, since I have to repeat the operation repeatedly over tables with 1e6 records. In R native mode, I believe that both tapply(), reshape() and d*ply() are dominated speed-wise by data.table. I would like to test the performance of the latter against a sqlite-based solution (or other DB). Has this been done before? Are there performance gains? And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is variable and not known in advance?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
gappy
  • 10,095
  • 14
  • 54
  • 73
  • 1
    have you tried reshape2? reshape2 is more than 10x faster than reshape in this simple case. Pretty impressive. Simple indexing by names is much slower than both. – Eduardo Leoni Mar 15 '11 at 11:28
  • 4
    How about backing up beliefs with data and providing a reproducible example? – hadley Mar 15 '11 at 16:48
  • In terms of speed it will need to be tested. I tend to use databases when working with "larger" data i.e. memory issues... That being said SQL will require you to be specific (i.e. for each case) but some databases do have (SQL Server, ...) functionality built for pivoting. In MySQL you can build a stored procedure which internally uses a prepared statement to build and execute the sql statement for you. In general you could use another language to "build" the query and then execute it (R --> SQLite). I would be very interested if you found a different way of doing this, especially in SQLite. – Jay Mar 16 '11 at 17:24

2 Answers2

4

I use an approach that is based on what tapply does, but is about an order of magnitude faster (primarily as there is no per-cell function call).

Timings using tall from Prasad's post:

pivot = function(col, row, value) {
  col = as.factor(col)
  row = as.factor(row)
  mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
  mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
  mat
}

> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
  11.31    0.03   11.36 

> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
   user  system elapsed 
    0.9     0.0     0.9 

Regarding possible issues with ordering, there shouldn't be any problem:

> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE
Charles
  • 4,389
  • 2
  • 16
  • 13
  • Very nice! I tried a simpler version with subscripting by name and it was much slower. +1 – Eduardo Leoni Mar 15 '11 at 15:53
  • 1
    This is basically how acast in reshape2 works. It's probably slower because it's a little more general - it looks like `pivot` assumes that there are no missing values and the data is correctly ordered. – hadley Mar 15 '11 at 16:48
  • @hadley - not sure what you're referring to wrt order? I've updated the answer to show it should be agnostic to ordering. As for `NA`s, while they won't cause any problems if present in the values, it will cause issues if they're in the keys. – Charles Mar 16 '11 at 03:42
1

A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for data.table, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.

If you want to do it purely in R, I think tapply is the speed champ here, much faster than acast from reshape2:

Create some tall data, with some holes in it just to make sure the code is doing the right thing:

tall <- data.frame( dt = rep(1:100, 100),
                     tkr = rep( paste('v',1:100,sep=''), each = 100),
                     value = rnorm(1e4)) [-(1:5), ]


> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
   user  system elapsed 
   4.73    0.00    4.73 

> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
   user  system elapsed 
   7.93    0.03    7.98 
Community
  • 1
  • 1
Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
  • Try with a larger matrix and you´ll probably see dcast (which for some reason is faster than acast) overtaking tapply. Or that was what I saw here. – Eduardo Leoni Mar 15 '11 at 15:51
  • Please read my question carefully as well as the posts you link to. I explictly asked: "And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is *variable and not known in advance?*" The SQLite posts don't address this question at all. Data.table can actually transform to wide. And it's much faster than tapply. – gappy Mar 15 '11 at 16:01
  • This answer doesn't assume the number of wide fields are known in advance. – hadley Mar 15 '11 at 16:44
  • Hadley, I was referring to the SQLite answer. My original question was restricted to SQLite capability. I know that tapply, reshape, **ply() all do the job. I am looking for the fastest alternative, and wondering if SQLite (or a DB is it). – gappy Mar 15 '11 at 20:22
  • To my knowledge those posts show you the way you would need to do that task in SQLite (or many other relational databses). See my comment on your question as well. – Jay Mar 16 '11 at 17:25
  • @gappy: "Data.table can actually transform to wide." Could you link to an example? This seems really useful. – Zach Apr 04 '13 at 19:46