reshaping a large data frame from wide to long in R

Question

I've been through the various reshape questions but don't believe this iteration has been asked before. I am dealing with a data frame of 81K rows and 4188 variables. Variables 161:4188 are the measurements present as different variables. The idvar is in column 1. I want to repeat columns 1:160 and create new records for columns 169:4188. The final data frame will be of the dimension 162 columns and 326,268,000 rows (81K * 4028 variables converted as unique records).

Here is what I tried:

reshapeddf <- reshape(c, idvar = "PID", varying = c(dput(names(c[161:4188]))), v.names = "viewership", timevar = "network.show", times = c(dput(names(c[161:4188]))), direction = "long")

The operation didn't complete. I waited almost 10 minutes. Is this the right way? I am on a Windows 7, 8GB RAM, i5 3.20ghz PC. What is the most efficient way to complete this transpose in R? Both of the answers by BondedDust and Nick are clever but I run into memory issues. Is there a way any of the three approaches in this thread- reshape, tidyr or the do.call be implemented using ff?

In Example Data below, columns 1:4 are the ones I want to repeat and columns 5:9 are the ones to create new records for.

structure(list(PID = c(1003401L, 1004801L, 1007601L, 1008601L, 
1008602L, 1011901L), HHID = c(10034L, 10048L, 10076L, 10086L, 
10086L, 10119L), HH.START.DATE = structure(c(1378440000, 1362974400, 
1399521600, 1352869200, 1352869200, 1404964800), class = c("POSIXct", 
"POSIXt"), tzone = ""), VISITOR.CODE = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("0", "L"), class = "factor"), WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI = c(0, 
0, 305892, 0, 101453, 0), WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH = c(0, 
0, 0, 0, 127281, 0), WEIGHTED.MINUTES.VIEWED..ABC...CASTLE = c(0, 
27805, 0, 0, 0, 0), WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS = c(0, 
679148, 0, 0, 278460, 498972), WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS = c(0, 
316448, 0, 0, 0, 0)), .Names = c("PID", "HHID", "HH.START.DATE", 
"VISITOR.CODE", "WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI", "WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH", 
"WEIGHTED.MINUTES.VIEWED..ABC...CASTLE", "WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS", 
"WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS"), row.names = c(NA, 
6L), class = "data.frame")

Can I have a sample of your data? It might be possible to solve it using the data.table package — Michele Usuelli, Dec 04 '14 at 23:50
Please post a simplified version of the data. Should be able to do that with 3 ID-variables and 5 or 6 value columns. Need to know what the column names look like. — IRTFM, Dec 04 '14 at 23:52

IRTFM · Accepted Answer · 2014-12-05T00:15:10.267

2

Might be as easy as something like this:

   dat2 <- cbind(dat[1:4],   stack( dat[5:length(dat)] )

edited Dec 05 '14 at 00:15

answered Dec 05 '14 at 00:00

IRTFM

258,963
21
364
487

i agree about the object named "c". just a bad habit of using letters. Also just realized the `reshape` code works with the example data. It's just taking too much time. Is the syntax of the `do.call` you've posted correct? (comma in the end etc.) – vagabond Dec 05 '14 at 00:06
No, it's not correct. I was just about to delete it until I tested the correct version. – IRTFM Dec 05 '14 at 00:07
`Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size) 2: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size) 3: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size) 4: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size)` – vagabond Dec 05 '14 at 00:31
running into memory issues for both approaches. – vagabond Dec 05 '14 at 00:31
Well, my questions was centered around the fact that my data set is large and I think there are ways of working with big data in R which is what I was hoping would come as suggestions / answers from the experienced R community on SO. – vagabond Dec 05 '14 at 00:40
Your object takes up roughly 3.3 GB ( `10 * 81000 * 4188 [1] 3392280000` ). There will be one or two intermediate copies of the data as well. I don't think we have enough experience to skirt around around the laws of silicon logic. I generally assume I need three times as much RAM as my largest object and by that rule of thumb yours won't fit in an 8GB machine. – IRTFM Dec 05 '14 at 00:51
Can this be implemented using `ff` somehow ? – vagabond Dec 05 '14 at 00:52
You are attempting to make a much bigger object than you already have because you are duplicating all the information in those columns 1:160. You would be better off learning to use a database back-end. – IRTFM Dec 05 '14 at 01:02
What if my original data by default was 8Gb or larger will a billions rows? Is the only solution to get more RAM? I think you are skirting the real problem but that's alright. – vagabond Dec 05 '14 at 01:13
@vagabond, how is he skirting the issue if your data will not fit with your machine's specs? Do you really need a single long dataset? Maybe it's better to stack chunks of columns, save each chunk, re-read them, and bind them back together? (Or use a database, as BondedDust recommended.) – A5C1D2H2I1M1N2O1R2T1 Dec 05 '14 at 06:29
In hindsight, yes I apologize. I can think through the objective better. It's just the first time I've run into memory issues with the data in R. – vagabond Dec 05 '14 at 14:07
I read a little about ff and how it creates a map of the original data in memory. Is that a possible solution if I insist on making the data long? – vagabond Dec 05 '14 at 14:16

score 1 · Answer 2 · answered Dec 05 '14 at 00:16

1

I think this should work:

library(tidyr)
newdf <- gather(yourdf, program, minutes, -PID:-VISITOR.CODE)

answered Dec 05 '14 at 00:16

Nick DiQuattro

729
4
7

`Error: cannot allocate vector of size 1.2 Gb In addition: Warning messages: 1: In melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Reached total allocation of 8100Mb: see help(memory.size) 2: In melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Reached total allocation of 8100Mb: see help(memory.size) 3: In melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Reached total allocation of 8100Mb: see help(memory.size)` – vagabond Dec 05 '14 at 00:30
running into memory issues. I am on 8gb RAM . – vagabond Dec 05 '14 at 00:32
@vagabond maybe data.table? I don't know it, but it's supposed to be good for larger datasets. – Nick DiQuattro Dec 05 '14 at 00:47

reshaping a large data frame from wide to long in R

2 Answers2