11

RStudio was crashing when I tried to reshape a particular data frame using dcast (from the reshape2 package). I discovered that the crash was actually happening in R itself, so I ran my casting code in R.app and got the type of error that gives this site its name: Error: segfault from C stack overflow. With the help of Google and SO, I learned that this is a memory access error.

Okay, I got that far, but I don't know where to go from here. I can't provide a true reproducible example, because my data frame is about 558,000 rows and the problem doesn't occur on small toy examples. For example, even if I take, say, a 50,000-row subset of the data, dcast works just fine. Could there be a particular row of data that's causing a problem? If so, can anyone suggest what feature(s) to look for that could be causing the type of error I'm getting?

Here is a subset of the data frame I'm casting from (with fake values for some variables), followed by the casting function I'm using. I've also included this small snippet of data in a dput function below, in case it would be helpful to play around with it. The real data set has about 700 values of prog, 15 values of prog1, and 5 values of fa.type.

  id        term   yr    nslds acad.lev    prog            prog1 fa.type amount
1  1   Fall 2009 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
2  1 Spring 2010 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
3  2   Fall 2009 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
4  2 Spring 2010 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
5  3   Fall 2007 2008 Graduate Graduate  loan 3    Stafford Loan    Loan   4250
6  3   Fall 2007 2008 Graduate Graduate grant 1 University Grant   Grant   1707

fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)

fa = structure(list(id = c(1, 1, 2, 2, 3, 3), term = structure(c(7L, 
8L, 7L, 8L, 1L, 1L), .Label = c("Fall 2007", "Spring 2008", "Summer 2008", 
"Fall 2008", "Spring 2009", "Summer 2009", "Fall 2009", "Spring 2010", 
"Summer 2010", "Fall 2010", "Spring 2011", "Summer 2011", "Fall 2011", 
"Spring 2012", "Summer 2012", "Fall 2012", "Spring 2013"), class = c("ordered", 
"factor")), yr = c(2010L, 2010L, 2010L, 2010L, 2008L, 2008L), 
    nslds = structure(c(7L, 7L, 7L, 7L, 7L, 7L), .Label = c("1st Year, Never Attended", 
    "1st Year, Previously Attended", "2nd Year", "3rd Year", 
    "4th Year", "5th Year+", "Graduate"), class = c("ordered", 
    "factor")), acad.lev = structure(c(6L, 6L, 6L, 6L, 6L, 6L
    ), .Label = c("Freshman", "Sophomore", "Junior", "Senior", 
    "PB Undergrad", "Graduate"), class = c("ordered", "factor"
    )), prog = c("loan 1", "loan 1", "loan 2", "loan 2", "loan 3", 
    "grant 1"), prog1 = c("Other Loans", "Other Loans", "Stafford Loan", 
    "Stafford Loan", "Stafford Loan", "University Grant"), fa.type = structure(c(3L, 
    3L, 3L, 3L, 3L, 2L), .Label = c("Athletic", "Grant", "Loan", 
    "Scholarship", "Waiver", "Work/Study"), class = "factor"), 
    amount = c(5000, 5000, 8781, 8781, 4250, 1707)), .Names = c("id", 
"term", "yr", "nslds", "acad.lev", "prog", "prog1", "fa.type", 
"amount"), row.names = c(NA, 6L), class = "data.frame")
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • 1
    Maybe you can cut your data into smaller pieces, run dcast on each and bind them together again after. – N8TRO Mar 05 '13 at 18:34
  • I will if I have to (or perhaps I'll try a different reshaping function from base R or from the original reshape package), but I'd like to get to the bottom of this for future reference and also to have a solution on SO in case someone else runs into a similar problem. – eipi10 Mar 05 '13 at 18:38
  • 1
    1. You should report this as an issue at https://github.com/hadley/reshape You could try aggregating first (using data.table), then reshaping to wide format-- this may reduce the size of the problem if this is causing the segfault. – mnel Mar 06 '13 at 00:26
  • I've reported the issue. Thanks for the suggestion. Also, I was able to use the `cast` function in the `rehshape` package to get my data reshaped, but I'd still like to know what's causing this error. If @Hadley and his team report anything on Github, I will post it here. – eipi10 Mar 06 '13 at 16:53
  • I know it might be a pain, but it would really help if you could try to post some code that simulates some data that reproduces the error. – nograpes Apr 04 '13 at 04:05
  • I've provided Hadley with a reproducible example and will report back here once the issue is resolved. I haven't been able to reproduce the error with simulated data and my real data file is 558,000 rows. – eipi10 Apr 04 '13 at 23:22
  • FYI, this problem has now been fixed. See https://github.com/hadley/reshape/issues/31. – eipi10 Apr 28 '14 at 21:36

3 Answers3

7

This isn't an answer, but a simple (non-sensical) reproducible example that wouldn't fit in the comments. You can recreate this error with this simple example (on my MacBookPro).

require(reshape2)
n = 1448
df <- data.frame( Student = rep( 1:n , each = 2 ) , Grade = sample( 100 , n*2 , repl = TRUE ) )
df2 <- dcast( df , Student ~ Student , value.var = "Grade" , sum )
Error: segfault from C stack overflow

The error occurs at the boundary n = 1448, i.e. it doesn't occur when n=1447 and below. It seems that the error is coming from split_indices in split-numeric.c from the package plyr. It could have to do with the fact that the number of grouping levels is assigned to an (unsigned?) integer value, and if the number of groups goes over 32767 it causes a memory access error, but TBH I'm clutching at straws now.

My sessionInfo() in case anyone can't recreate this error is:

R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.2

loaded via a namespace (and not attached):
[1] plyr_1.8      stringr_0.6.2

Interestingly, if I run the df2 <- command again after getting the first error, R crashes out completely and I get some OS generated error report. I include the relevant portion of the crash log here:

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_PROTECTION_FAILURE at 0x00007fff5f3ff120

VM Regions Near 0x7fff5f3ff120:
    JS JIT generated code  00004d431a401000-00004d431a402000 [    4K] ---/rwx SM=NUL  
--> STACK GUARD            00007fff5bc00000-00007fff5f400000 [ 56.0M] ---/rwx SM=NUL  stack guard for thread 0
    Stack                  00007fff5f400000-00007fff5fc00000 [ 8192K] rw-/rwx SM=COW  thread 0

Application Specific Information:
objc[57147]: garbage collection is OFF

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_c.dylib               0x00007fff897c4632 small_free_scan_madvise_free + 41
1   libsystem_c.dylib               0x00007fff897c5f06 szone_free_definite_size + 4186
2   libsystem_c.dylib               0x00007fff897fe789 free + 194
3   libR.dylib                      0x0000000100222dbf R_gc_internal + 7327 (memory.c:952)
4   libR.dylib                      0x0000000100224919 Rf_allocVector + 841 (memory.c:2356)
5   plyr.so                         0x000000010144bd2c split_indices + 204 (split-numeric.c:23)
6   libR.dylib                      0x00000001001b4cc7 do_dotcall + 16311 (dotcode.c:593)
7   libR.dylib                      0x00000001001e4448 Rf_eval + 1672 (eval.c:494)
8   libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
9   libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
10  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
11  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
12  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
13  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
14  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
15  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
16  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
17  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
18  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
19  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
20  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
21  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
22  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
23  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
24  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
25  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
26  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
27  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
28  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
29  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
30  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
31  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
32  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
33  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
34  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
35  libR.dylib                      0x000000010021c761 R_ReplDLLdo1 + 481 (main.c:362)
36  org.R-project.R                 0x0000000100022c24 run_REngineRmainloop + 196
37  org.R-project.R                 0x00000001000159b7 -[REngine runREPL] + 119
38  org.R-project.R                 0x0000000100001f24 main + 852
39  org.R-project.R                 0x0000000100001914 start + 52
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • @hadley shall I submit a bug report for this since I can reproduce this crash? – Simon O'Hanlon Apr 21 '13 at 21:26
  • Hi @SimonO101 have you been able to go around this problem? I am running into the exact same problem on R version 3.0.1... – GodinA Aug 13 '13 at 17:56
  • 1
    hum... @SimonO101, any suggestions? Split dataframe into several, then run again dcast? thanks! – GodinA Aug 15 '13 at 12:52
  • @GodinA use the `cast` function in the `reshape` package for now. It works basically the same way as `dcast` but doesn't suffer from the segfault bug (as far as I can tell). – eipi10 Aug 29 '13 at 17:10
  • It does not crash any more on `R version 3.0.3`. – djhurio Mar 12 '14 at 07:27
  • 1
    @djhurio it's nothing to do with the R version change, rather the bug was fixed in the package. See [**Reshape fix #31**](https://github.com/hadley/reshape/issues/31) – Simon O'Hanlon Mar 12 '14 at 23:47
1

I'm having a same problem in pivoting a long table to wide one using dcast in package reshape2. I found solution in this post plyr split_indices function crashes for long vectors. Specifically, you could download the split_numeric.c and loop-apply.c in this page https://github.com/hadley/plyr/tree/master/src. Uninstall the package plyr from R console, and finally reinstall the package locally: install.packages('/path/to/source', repos=NULL, type='source').

This solves my problem, hope it helps.

Community
  • 1
  • 1
X.X
  • 961
  • 2
  • 12
  • 16
0

Just to close out this old question, this was a bug that was fixed as described in this github issue.

eipi10
  • 91,525
  • 24
  • 209
  • 285