0

I would like to convert an R data.frame containing a list of uneven length to long format as in the example below.

testdf <- data.frame(a=1:3,b=I(list(letters[1:2],letters[2:7],letters[1:3])))
tesdf 
  a            b
1 1         a, b
2 2 b, c, d,....
3 3      a, b, c

the resulting format should look like this

   a b
1  1 a
2  1 b
3  2 b
4  2 c
5  2 d
6  2 e
7  2 f
8  2 g
9  3 a
10 3 b
11 3 c

A for-loop-way to achieve this would look something like this

resultdf <- data.frame()
for(i in as.numeric(row.names(testdf))){resultdf <- rbind(resultdf, data.frame(a=testdf$a[i],b=unlist(testdf$b[i][[1]])))}

however this proves to be very slow and I need to to this for a large data.frame (~6mio rows and an average length of list of 10 items). I tried melt as follows

library(reshape2)
>  melt(test, id.var="a", value.var="b")
Error: Can't melt data.frames with non-atomic 'measure' columns

but I'm not even sure if melt is meant to work with lists. Which would be the fastest was to do this?


btw.: I produce the initial data.frame by using str_extract_all()

Johan
  • 74,508
  • 24
  • 191
  • 319
supersambo
  • 811
  • 1
  • 9
  • 25
  • `library(data.table) ; setDT(testdf)[, .(b = unlist(b)), by = a]` or `library(splitstackshape) ; res <- listCol_l(testdf, "b")` – David Arenburg Aug 25 '15 at 08:22
  • thanks a lot! this is it. and its incredibly fast! I'll go for the data.table library. can you post this as an answer to my question? – supersambo Aug 25 '15 at 08:33
  • 1
    I already closed it as a dupe. These answers already appear there. If you have additional columns in your data I would suggest the `splitstackshape` package rout (it uses `data.table` under the hood). Still not sure why `tidyr::unnest` doesn't work here, but don't care really. – David Arenburg Aug 25 '15 at 08:35
  • @DavidArenburg, not quite sure what you mean. `tidyr::unnest(testdf, b)` works perfectly fine for me.. – talat Aug 25 '15 at 09:17
  • @docendodiscimus I've tried `tidyr::unnest(testdf)` as per the dupe and it didn't work. Not sure when you need to specify the column and when you shouldn't – David Arenburg Aug 25 '15 at 09:23
  • @DavidArenburg, the equivalent of the dupe would be `testdf %>% unnest(b)`. It's in the documentation – talat Aug 25 '15 at 09:27
  • @docendodiscimus oh right, I havn't noticed that the column was actually specified there because of the unnecessary pipe – David Arenburg Aug 25 '15 at 09:29

0 Answers0