2

I am working with rather large datasets (appx. 4 mio rows per month with 25 numberic attributes and 4 factor attributes). I would like to create a graph that contains per month (for the last 36 months) a boxplot for each numeric attribute per product (one of the 4 factor attributes).

So as an example for product A:

                    - 
      _             |          -
     _|_            |         _|_
    |   |           |        |   |
    |   |          _|_       |   |
    |   |         |   |      |---|
    |   |         |---|      |   |
    |---|         |   |      |   |
    |_ _|         |   |      |_ _|
      |           |_ _|        |
      |             |          |
      -             |          -
                    -

 --------------------------------------------------------------
    jan '10      feb '10    mar '10 ................... feb '13

But since these are quite large datasets I will be working with I would like some advice to get started on how to approach. My idea (but I am not sure if this is possible) is to

  • a) extract the data per month per product
  • b) create a boxplot for that specific month (so let's say jan'10 for product A)
  • c) store the boxplot summary data somewhere
  • d) repeat a-c for all months until feb '13
  • e) combine all the stored boxplot summary data into one
  • f) plot the combined boxplot g) repeat a-f for all other products

So my main question is: is it possible to combine separate boxlot summaries into one and create the combined graph as sketched above from this?

Any help would be appreciated,

Thank you

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Geoffrey Stoel
  • 1,300
  • 3
  • 14
  • 24
  • To get you started, you can do things like `result <- boxplot(1:10,plot=FALSE)` and then `bxp(result)` See `?boxplot` and `?bxp` – thelatemail Feb 28 '13 at 23:39

2 Answers2

2

Here's a long-hand example that you can probably cook something up around:

Read in the individual datasets - you might want to overwrite the same data or wrap this step in a function given the large data you are using.

dset1 <- 1:10
dset2 <- 10:20
dset3 <- 20:30

Store some boxplot info, notice the plot=FALSE

result1 <- boxplot(dset1,plot=FALSE,names="month1")
result2 <- boxplot(dset2,plot=FALSE,names="month2")
result3 <- boxplot(dset3,plot=FALSE,names="month3")

Group up the data and plot with bxp

mylist <- list(result1, result2, result3)
groupbxp <- do.call(mapply, c(cbind, mylist))
bxp(groupbxp)

Result:

enter image description here

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • this was exactly what I was looking for.... thanks a lot... this way I can monthly update the overview graph by adding the calculated boxplot to the list... great – Geoffrey Stoel Mar 01 '13 at 06:44
  • Ok.... have been playing around with it (sorry quite new to R)... but not yet fully understanding what the do.call(mapply, c(cbind, mylist)) function is doing.... – Geoffrey Stoel Mar 06 '13 at 17:38
  • 1
    @GeoffreyStoel - this: http://stackoverflow.com/questions/15148451/cbind-items-from-multiple-lists-recursively It is joining the individual boxplot results (`result1`,`result2`...) together into one object so that `bxp` can plot everything at once. At a technical level, it is constructing a call to column bind, or `?cbind` each item of each of (`result1`,`result2`...) together. Try comparing what is printed out for `mylist` vs `groupbxp` and it should be a bit clearer. – thelatemail Mar 06 '13 at 19:49
0

You will not be able to predict with absolute precision what the values of the "fivenum" values will be for combined assembly of values. Think about the situation with two groups for which you have the 75th percentiles in each group and the counts of observations in each group. Suppose the percentiles are unequal. You cannot just take the weighted mean of the percentiles to get the 75th percentile of the aggregated values. The see the help page for ?boxplot.stats. I would think, however, that you might come very close by using the median values of the fivenum collections. This might be a place to start your examinations.

 mo.mtx <- tapply(dat$values, dat$month, function( mo.dat) c( fivenum(mo.dat), length(mo.dat) ) 
 matplot( mo.mtx[, 1:5] , type="l" )
IRTFM
  • 258,963
  • 21
  • 364
  • 487