2

I've been through the various reshape questions but don't believe this iteration has been asked before. I am dealing with a data frame of 81K rows and 4188 variables. Variables 161:4188 are the measurements present as different variables. The idvar is in column 1. I want to repeat columns 1:160 and create new records for columns 169:4188. The final data frame will be of the dimension 162 columns and 326,268,000 rows (81K * 4028 variables converted as unique records).

Here is what I tried:

reshapeddf <- reshape(c, idvar = "PID", varying = c(dput(names(c[161:4188]))), v.names = "viewership", timevar = "network.show", times = c(dput(names(c[161:4188]))), direction = "long")

The operation didn't complete. I waited almost 10 minutes. Is this the right way? I am on a Windows 7, 8GB RAM, i5 3.20ghz PC. What is the most efficient way to complete this transpose in R? Both of the answers by BondedDust and Nick are clever but I run into memory issues. Is there a way any of the three approaches in this thread- reshape, tidyr or the do.call be implemented using ff?

In Example Data below, columns 1:4 are the ones I want to repeat and columns 5:9 are the ones to create new records for.

structure(list(PID = c(1003401L, 1004801L, 1007601L, 1008601L, 
1008602L, 1011901L), HHID = c(10034L, 10048L, 10076L, 10086L, 
10086L, 10119L), HH.START.DATE = structure(c(1378440000, 1362974400, 
1399521600, 1352869200, 1352869200, 1404964800), class = c("POSIXct", 
"POSIXt"), tzone = ""), VISITOR.CODE = structure(c(1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("0", "L"), class = "factor"), WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI = c(0, 
0, 305892, 0, 101453, 0), WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH = c(0, 
0, 0, 0, 127281, 0), WEIGHTED.MINUTES.VIEWED..ABC...CASTLE = c(0, 
27805, 0, 0, 0, 0), WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS = c(0, 
679148, 0, 0, 278460, 498972), WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS = c(0, 
316448, 0, 0, 0, 0)), .Names = c("PID", "HHID", "HH.START.DATE", 
"VISITOR.CODE", "WEIGHTED.MINUTES.VIEWED..ABC...20.20.FRI", "WEIGHTED.MINUTES.VIEWED..ABC...BLACK.ISH", 
"WEIGHTED.MINUTES.VIEWED..ABC...CASTLE", "WEIGHTED.MINUTES.VIEWED..ABC...CMA.AWARDS", 
"WEIGHTED.MINUTES.VIEWED..ABC...COUNTDOWN.TO.CMA.AWARDS"), row.names = c(NA, 
6L), class = "data.frame")
vagabond
  • 3,526
  • 5
  • 43
  • 76
  • Can I have a sample of your data? It might be possible to solve it using the data.table package – Michele Usuelli Dec 04 '14 at 23:50
  • Please post a simplified version of the data. Should be able to do that with 3 ID-variables and 5 or 6 value columns. Need to know what the column names look like. – IRTFM Dec 04 '14 at 23:52

2 Answers2

2

Might be as easy as something like this:

   dat2 <- cbind(dat[1:4],   stack( dat[5:length(dat)] )
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • i agree about the object named "c". just a bad habit of using letters. Also just realized the `reshape` code works with the example data. It's just taking too much time. Is the syntax of the `do.call` you've posted correct? (comma in the end etc.) – vagabond Dec 05 '14 at 00:06
  • No, it's not correct. I was just about to delete it until I tested the correct version. – IRTFM Dec 05 '14 at 00:07
  • `Error: cannot allocate vector of size 2.4 Gb In addition: Warning messages: 1: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size) 2: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size) 3: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size) 4: In unlist(unname(x)) : Reached total allocation of 8100Mb: see help(memory.size)` – vagabond Dec 05 '14 at 00:31
  • running into memory issues for both approaches. – vagabond Dec 05 '14 at 00:31
  • Well, my questions was centered around the fact that my data set is large and I think there are ways of working with big data in R which is what I was hoping would come as suggestions / answers from the experienced R community on SO. – vagabond Dec 05 '14 at 00:40
  • Your object takes up roughly 3.3 GB ( `10 * 81000 * 4188 [1] 3392280000` ). There will be one or two intermediate copies of the data as well. I don't think we have enough experience to skirt around around the laws of silicon logic. I generally assume I need three times as much RAM as my largest object and by that rule of thumb yours won't fit in an 8GB machine. – IRTFM Dec 05 '14 at 00:51
  • Can this be implemented using `ff` somehow ? – vagabond Dec 05 '14 at 00:52
  • You are attempting to make a much bigger object than you already have because you are duplicating all the information in those columns 1:160. You would be better off learning to use a database back-end. – IRTFM Dec 05 '14 at 01:02
  • What if my original data by default was 8Gb or larger will a billions rows? Is the only solution to get more RAM? I think you are skirting the real problem but that's alright. – vagabond Dec 05 '14 at 01:13
  • @vagabond, how is he skirting the issue if your data will not fit with your machine's specs? Do you really need a single long dataset? Maybe it's better to stack chunks of columns, save each chunk, re-read them, and bind them back together? (Or use a database, as BondedDust recommended.) – A5C1D2H2I1M1N2O1R2T1 Dec 05 '14 at 06:29
  • In hindsight, yes I apologize. I can think through the objective better. It's just the first time I've run into memory issues with the data in R. – vagabond Dec 05 '14 at 14:07
  • I read a little about ff and how it creates a map of the original data in memory. Is that a possible solution if I insist on making the data long? – vagabond Dec 05 '14 at 14:16
1

I think this should work:

library(tidyr)
newdf <- gather(yourdf, program, minutes, -PID:-VISITOR.CODE)
Nick DiQuattro
  • 729
  • 4
  • 7
  • `Error: cannot allocate vector of size 1.2 Gb In addition: Warning messages: 1: In melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Reached total allocation of 8100Mb: see help(memory.size) 2: In melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Reached total allocation of 8100Mb: see help(memory.size) 3: In melt_dataframe(data, as.integer(id.ind - 1), as.integer(measure.ind - : Reached total allocation of 8100Mb: see help(memory.size)` – vagabond Dec 05 '14 at 00:30
  • running into memory issues. I am on 8gb RAM . – vagabond Dec 05 '14 at 00:32
  • @vagabond maybe data.table? I don't know it, but it's supposed to be good for larger datasets. – Nick DiQuattro Dec 05 '14 at 00:47