2

I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata). The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?

restructure <- function(data, firstcol, numcol, numsets){
    out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
    names(out) <- names(daten[0:(firstcol+numcol-1)])
      for(i in 1:nrow(daten)){
         fixdata <- (daten[i, 1:(firstcol-1)])

          for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
              flexdata <- daten[i, j:(j+numcol-1)]
              tmp <- cbind(fixdata, flexdata)
              names(tmp) <- names(daten[0:(firstcol+numcol-1)])
              out <- rbind(out,tmp)
          }  
      }
    out <- out[2:nrow(out),]
    return(out)
}

Thanks in advance!

Community
  • 1
  • 1
Elisa
  • 215
  • 1
  • 3
  • 11
  • 4
    This sounds like a reshape problem. Have a look at function `melt` in package `reshape2`. Your `data.frame` is really not that enormous. I'd be very surprised if `melt` can't handle this in less than one second. (Similar questions have cropped up regularly on SO. Do a search for `[r] reshape` for some inspiration. – Andrie Nov 04 '11 at 11:31
  • Andrie is correct. In general, try not to use for loops on big datasets. Meanwhile, you could experiment as follows: `system.time(restructure([30datasets]))`, then `system.time(restructure([300datasets]))`, etc. This will at least give you an idea of how long your existing code would take for N datasets. – Carl Witthoft Nov 04 '11 at 11:47
  • 3
    Adding a testcase would help... – Joris Meys Nov 04 '11 at 12:05
  • 3
    @CarlWitthoft: loops aren't the main problem, memory allocation is. The `cbind` and `rbind` calls inside the inner loop will make the code _much_ slower than if `out` were pre-allocated outside the first loop and results were assigned via indexing. – Joshua Ulrich Nov 04 '11 at 12:08
  • okay, I´ll try "melt". I´ve never worked with the reshape package before. Can you give me a hint how I would use "melt" in that specific case? – Elisa Nov 04 '11 at 12:08
  • @JoshuaUlrich: too true. I should have caught that problem. – Carl Witthoft Nov 04 '11 at 14:07

3 Answers3

5

Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.

So, as Andrie tells you can look at melt.

Or you can do it with core R: stack. Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols

A third alternative would be array2df from package arrayhelpers.

cbeleites unhappy with SX
  • 13,717
  • 5
  • 45
  • 57
  • melt seems to be helpful, however I was wondering how I melt down the dataframe in chunks. Lets say I have ID, TestA1, TestA2, TestB1 and TestB2. I want to have something like: `row 1: ID | TestA1 | TestA2` and `row 2: ID | TestB1 | TestB2` and so on. Is that possible with melt? – Elisa Nov 04 '11 at 12:21
  • `melt` melts the data set all the way, `cast` puts it back in the shape you want. – Ben Bolker Nov 04 '11 at 12:37
  • I don´t get it to melt in down in chunks though, it only melts down all the variables one by one...does anyone know how to do that? (see example above) – Elisa Nov 04 '11 at 12:58
  • This answer helped me to understand that the loop probably wont work with my data, so I tried it a different way: http://stackoverflow.com/questions/8011786/r-reshape-restructure-dataframe-by-chunks – Elisa Nov 05 '11 at 12:21
1

I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:

As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.

Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.

Matt Bannert
  • 27,631
  • 38
  • 141
  • 207
0

Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.

yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns

That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.

yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)

And now you're done... mostly, you might want to make an extra column that names the new rows

newDat$condNum <- rep(1:10, nrow(newDat)/10)

This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.

This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.

(read this to better understand your problems in this code... you've made most of the big errors in there)

John
  • 23,360
  • 7
  • 57
  • 83