4

I am trying to come up with reproducible example (RE) for this question: Errors related to data frame columns during merging. To be qualified as having a RE, the question lacks only reproducible data. However, when I tried to use pretty much standard approach of dput(head(myDataObj)), the output produced is 14MB size file. The problem is that my data object is a list of data frames, so head() limitation doesn't appear to work recursively.

I haven't found any options for dput() and head() functions that would allow me to control data size recursively for complex objects. Unless I am wrong on the above, what other approaches to creating a minimal RE dataset would you recommend me in this situation?

Community
  • 1
  • 1
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • 1
    Does it only not work for certain data values? Can you just create simulated data with something like `replicate(5, data.frame(x=1:10, y=cumsum(runif(10))), simplify=F)`? – MrFlick Aug 04 '14 at 20:40
  • 3
    Or you can take the head of each of the data.frames in the list first `dput(lapply(myDataObj, head))` but it sounds like that will still be large. – MrFlick Aug 04 '14 at 20:44
  • @MrFlick: Appreciate your fast feedback! I will try to use both of your recommendations and report back. However, creating a simulated data versus real data might not help in building a RE, but, on the other hand, might expose the fact that the data is the reason of the experienced issues. – Aleksandr Blekh Aug 04 '14 at 21:09
  • @MrFlick: Your `dput(lapply(myDataObj, head))` advice seems to have worked perfectly. Thanks again! Would appreciate, if you could give feedback on my original question, referenced above. – Aleksandr Blekh Aug 04 '14 at 21:48
  • 1
    The other option is to dissect the problem first. In many cases, the actual problem is part of a bigger setting, but the bigger setting isn't necessary to solve the particular problem. Without extra information it's difficult to give more concrete advice obviously – Joris Meys Aug 05 '14 at 18:01

1 Answers1

2

Along the lines of @MrFlick's comment of using lapply, you may use any of the apply family of functions to perform the head or sample functions depending on your needs in order to reduce the size for both REs and for testing purposes (I've found that working with subsets or subsamples of large sets of data is preferable for debugging and even charting).

It should be noted that head and tail provide the first or last bits of a structure, but sometimes these don't have sufficient variance in them for RE purposes, and are certainly not random, which is where sample may become more useful.

Suppose we have a hierarchical tree structure (list of lists of...) and we want to subset each "leaf" while preserving the structure and labels in the tree.

x <- list( 
    a=1:10, 
    b=list( ba=1:10, bb=1:10 ), 
    c=list( ca=list( caa=1:10, cab=letters[1:10], cac="hello" ), cb=toupper( letters[1:10] ) ) )

NOTE: In the following, I actually can't tell the difference between using how="replace" and how="list".

ALSO NOTE: This won't be great for data.frame leaf nodes.

# Set seed so the example is reproducible with randomized methods:
set.seed(1)

You can use the default head in a recursive apply in this way:

rapply( x, head, how="replace" )

Or pass an anonymous function that modifies the behavior:

# Complete anonymous function
rapply( x, function(y){ head(y,2) }, how="replace" )
# Same behavior, but using the rapply "..." argument to pass the n=2 to head.
rapply( x, head, how="replace", n=2 )

The following gets a randomized sample ordering of each leaf:

# This works because we use minimum in case leaves are shorter
# than the requested maximum length.
rapply( x, function(y){ sample(y, size=min(length(y),2) ) }, how="replace" )

# Less efficient, but maybe easier to read:
rapply( x, function(y){ head(sample(y)) }, how="replace" )  

# XXX: Does NOT work The following does **not** work 
# because `sample` with a `size` greater than the 
# item being sampled does not work (when 
# sampling without replacement)
rapply( x, function(y){ sample(y, size=2) }, how="replace" )
Kalin
  • 1,691
  • 2
  • 16
  • 22
  • +1 and accepted. Wow - I've had this issue quite a while ago. @MrFlick's comment was enough to solve my problem at the time. However, I appreciate your answer and some details that go beyond his advice. – Aleksandr Blekh Apr 06 '15 at 22:11
  • 1
    Saying "any of the apply functions will be suitable" is likely to be incorrect since `apply`-itself will always coerce its inputs to atomic vectors. – IRTFM Oct 22 '15 at 15:46