Methodology of high-dimensional data structuring in R vs. MATLAB

Question

Question

What is the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? I don't want to slip back to MATLAB.

Explanation

I like R's analysis functions and syntax (and stunning plots) much better than MATLAB's, and have been working hard to refactor my stuff over. However, I keep getting hung up on the way data is organized in my work.

MATLAB

It's typical for me to work with multivariate time series repeated over many trials, which are stored in a big ~~matrix~~ ~~rank-3 tensor~~ multidimensional array of SERIESxSAMPLESxTRIALS. This lends itself to some nice linear algebra stuff occasionally, but is clumsy when it comes to another variable, namely CLASS. Typically class labels are stored in another vector of dimension 1xTRIALS.

When it comes to analysis I basically plot as little as possible, because it takes so much work to get together a really good plot that teaches you a lot about the data in MATLAB. (I'm not the only one who feels this way).

R

In R I've been sticking as close as I can to the MATLAB structure, but things get annoyingly complex when trying to keep the class labeling separate; I'd have to keep passing the labels into functions in even though I'm only using their attributes. So what I've done is separate the array into a list of arrays by CLASS. This adds complexity to all of my apply() functions, but seems to be worth it in terms of keeping things consistent (and bugs out).

On the other hand, R just doesn't seem to be friendly with tensors/multidimensional arrays. Just to work with them, you need to grab the abind library. Documentation on multivariate analysis, like this example seems to operate under the assumption that you have a huge 2-D table of data points like ~~some long medieval scroll~~ a data frame, and doesn't mention how to get 'there' from where I am.

Once I get to plotting and classifying the processed data, it's not such a big problem, since by then I've worked my way down to data frame-friendly structures with shapes like TRIALSxFEATURES (melt has helped a lot with this). On the other hand, if I want to quickly generate a scatterplot matrix or latticist histogram set for the exploratory phase (i.e. statistical moments, separation, in/between-class variance, histograms, etc.), I have to stop and figure out how I'm going to apply() these huge multidimensional arrays into something those libraries understand.

If I keep pounding around in the jungle coming up with ad-hoc solutions for this, I'm either never going to get better or I'll end up with my own weird wizardly ways of doing it that don't make sense to anybody.

So what's the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? Please, I don't want to slip back to MATLAB.

Bonus: I tend to repeat these analyses over identical data structures for multiple subjects. Is there a better general way than wrapping the code chunks into for loops?

Many R functions expect that you data is in "long format", i.e., what you get when you `melt` your data. There are exceptions, but in general you should store your data in data.frames (or if it's big in data.tables) with value column(s) and factor columns (classifiers). But you also might want to have a look at packages that offer time series objects in R more sophisticated than the base `ts` object (e.g., the xts package). — Roland, Jan 12 '14 at 12:04
I do that for the exploratory part, for sure, but for example, if I am applying some transformation across all the matrices represented by each trial, I'd have to do some logical/categorical indexing down the long table to get the same result. I don't know if that's idiomatic, efficient, or natural. — bright-star, Jan 15 '14 at 07:14
It is ideomatic. Have a look at package data.table, which makes this extremely efficient. — Roland, Jan 15 '14 at 08:04
Btw., I don't say you shouldn't use arrays. They are very useful and you can write some very efficient code with arrays. It's just that many functions expect other input. And a long format data.frame with many factor variables is easier to wrap your head around than a multidimensional array. — Roland, Jan 15 '14 at 10:15
I have put a copy of the core question on top, especially MATLAB users may not be eager to read all the way to the bottom to find out that they cannot contribute. — Dennis Jaheruddin, Jan 15 '14 at 12:41
Regarding you "Bonus": Put everything in one big data.table and use the `by` syntax of the package. — Roland, Jan 15 '14 at 13:24
Looks like I need to take a spare week and sit down with the data.table package. — bright-star, Jan 15 '14 at 21:17
I'd also recommend reading http://vita.had.co.nz/papers/tidy-data.html - it lays out my philosophy of working with data in R. — hadley, Mar 26 '14 at 15:27
@hadley Wow, that was worth it. Since then, have you had any ideas about working with multi-dimensional arrays? So-called "epoched" multivariate data of that shape is pretty common in biomedical fields, where MATLAB reigns. — bright-star, Apr 02 '14 at 10:24
@TrevorAlexander yes, a little, basically `dplyr::tbl_cube`. — hadley, Apr 02 '14 at 20:30

score 19 · Answer 1 · edited Jan 18 '14 at 00:49

As has been pointed out, many of the more powerful analytical and visualization tools rely on data in long format. Certainly for transformations that benefit from matrix algebra you should keep stuff in arrays, but as soon as you're wanting run parallel analysis on subsets of your data, or plot stuff by factors in your data, you really want to melt.

Here is an example to get you started with data.table and ggplot.

Array -> Data Table

First, let's make some data in your format:

series <- 3
samples <- 2
trials <- 4

trial.labs <- paste("tr", seq(len=trials))
trial.class <- sample(c("A", "B"), trials, rep=T)

arr <- array(
  runif(series * samples * trials), 
  dim=c(series, samples, trials),
  dimnames=list(
    ser=paste("ser", seq(len=series)), 
    smp=paste("smp", seq(len=samples)), 
    tr=trial.labs
  )
)
# , , tr = Trial 1
#        smp
# ser         smp 1     smp 2
#   ser 1 0.9648542 0.4134501
#   ser 2 0.7285704 0.1393077
#   ser 3 0.3142587 0.1012979
#
# ... omitted 2 trials ...
# 
# , , tr = Trial 4
#        smp
# ser         smp 1     smp 2
#   ser 1 0.5867905 0.5160964
#   ser 2 0.2432201 0.7702306
#   ser 3 0.2671743 0.8568685

Now we have a 3 dimensional array. Let's melt and turn it into a data.table (note melt operates on data.frames, which are basically data.tables sans bells & whistles, so we have to first melt, then convert to data.table):

library(reshape2)
library(data.table)

dt.raw <- data.table(melt(arr), key="tr")  # we'll get to what the `key` arg is doing later
#       ser   smp   tr      value
#  1: ser 1 smp 1 tr 1 0.53178276
#  2: ser 2 smp 1 tr 1 0.28574271
#  3: ser 3 smp 1 tr 1 0.62991366
#  4: ser 1 smp 2 tr 1 0.31073376
#  5: ser 2 smp 2 tr 1 0.36098971
# ---                            
# 20: ser 2 smp 1 tr 4 0.38049334
# 21: ser 3 smp 1 tr 4 0.14170226
# 22: ser 1 smp 2 tr 4 0.63719962
# 23: ser 2 smp 2 tr 4 0.07100314
# 24: ser 3 smp 2 tr 4 0.11864134

Notice how easy this was, with all our dimension labels trickling through to the long format. One of the bells & whistles of data.tables is the ability to do indexed merges between data.tables (much like MySQL indexed joins). So here, we will do that to bind the class to our data:

dt <- dt.raw[J(trial.labs, class=trial.class)]  # on the fly mapping of trials to class
#          tr   ser   smp     value class
#  1: Trial 1 ser 1 smp 1 0.9648542     A
#  2: Trial 1 ser 2 smp 1 0.7285704     A
#  3: Trial 1 ser 3 smp 1 0.3142587     A
#  4: Trial 1 ser 1 smp 2 0.4134501     A
#  5: Trial 1 ser 2 smp 2 0.1393077     A
# ---                                    
# 20: Trial 4 ser 2 smp 1 0.2432201     A
# 21: Trial 4 ser 3 smp 1 0.2671743     A
# 22: Trial 4 ser 1 smp 2 0.5160964     A
# 23: Trial 4 ser 2 smp 2 0.7702306     A
# 24: Trial 4 ser 3 smp 2 0.8568685     A

A few things to understand:

J creates a data.table from vectors
attempting to subset the rows of one data.table with another data table (i.e. using a data.table as the first argument after the brace in [.data.table) causes data.table to left join (in MySQL parlance) the outer table (dt in this case) to the inner table (the one created on the fly by J) in this case. The join is done on the key column(s) of the outer data.table, which as you may have noticed we defined in the melt/data.table conversion step earlier.

You'll have to read the documentation to fully understand what's going on, but think of J(trial.labs, class=trial.class) being effectively equivalent to creating a data.table with data.table(trial.labs, class=trial.class), except J only works when used inside [.data.table.

So now, in one easy step we have our class data attached to the values. Again, if you need matrix algebra, operate on your array first, and then in two or three easy commands switch back to the long format. As noted in the comments, you probably don't want to be going back and forth from the long to array formats unless you have a really good reason to be doing so.

Once things are in data.table, you can group/aggregate your data (similar to the concept of split-apply-combine style) quite easily. Suppose we want to get summary statistics for each class-sample combination:

dt[, as.list(summary(value)), by=list(class, smp)]

#    class   smp    Min. 1st Qu. Median   Mean 3rd Qu.   Max.
# 1:     A smp 1 0.08324  0.2537 0.3143 0.4708  0.7286 0.9649
# 2:     A smp 2 0.10130  0.1609 0.5161 0.4749  0.6894 0.8569
# 3:     B smp 1 0.14050  0.3089 0.4773 0.5049  0.6872 0.8970
# 4:     B smp 2 0.08294  0.1196 0.1562 0.3818  0.5313 0.9063

Here, we just give data.table an expression (as.list(summary(value))) to evaluate for every class, smp subset of the data (as specified in the by expression). We need as.list so that the results are re-assembled by data.table as columns.

You could just as easily have calculated moments (e.g. list(mean(value), var(value), (value - mean(value))^3) for any combination of the class/sample/trial/series variables.

If you want to do simple transformations to the data it is very easy with data.table:

dt[, value:=value * 10]  # modify in place with `:=`, very efficient
dt[1:2]                  # see, `value` now 10x    
#         tr   ser   smp    value class
# 1: Trial 1 ser 1 smp 1 9.648542     A
# 2: Trial 1 ser 2 smp 1 7.285704     A

This is an in-place transformation, so there are no memory copies, which makes it fast. Generally data.table tries to use memory as efficiently as possible and as such is one of the fastest ways to do this type of analysis.

Plotting From Long Format

ggplot is fantastic for plotting data in long format. I won't get into the details of what's happening, but hopefully the images will give you an idea of what you can do

library(ggplot2)
ggplot(data=dt, aes(x=ser, y=smp, color=class, size=value)) + 
  geom_point() +
  facet_wrap( ~ tr)

enter image description here

ggplot(data=dt, aes(x=tr, y=value, fill=class)) + 
  geom_bar(stat="identity") +
  facet_grid(smp ~ ser)

enter image description here

ggplot(data=dt, aes(x=tr, y=paste(ser, smp))) + 
  geom_tile(aes(fill=value)) + 
  geom_point(aes(shape=class), size=5) + 
  scale_fill_gradient2(low="yellow", high="blue", midpoint=median(dt$value))

enter image description here

Data Table -> Array -> Data Table

First we need to acast (from package reshape2) our data table back to an array:

arr.2 <- acast(dt, ser ~ smp ~ tr, value.var="value")
dimnames(arr.2) <- dimnames(arr)  # unfortunately `acast` doesn't preserve dimnames properly
# , , tr = Trial 1
#        smp
# ser        smp 1    smp 2
#   ser 1 9.648542 4.134501
#   ser 2 7.285704 1.393077
#   ser 3 3.142587 1.012979
# ... omitted 3 trials ...

At this point, arr.2 looks just like arr did, except with values multiplied by 10. Note we had to drop the class column. Now, let's do some trivial matrix algebra

shuff.mat <- matrix(c(0, 1, 1, 0), nrow=2) # re-order columns
for(i in 1:dim(arr.2)[3]) arr.2[, , i] <- arr.2[, , i] %*% shuff.mat

Now, let's go back to long format with melt. Note the key argument:

dt.2 <- data.table(melt(arr.2, value.name="new.value"), key=c("tr", "ser", "smp"))

Finally, let's join back dt and dt.2. Here you need to be careful. The behavior of data.table is that the inner table will be joined to the outer table based on all the keys of the inner table if the outer table has no keys. If the inner table has keys, data.table will join key to key. This is a problem here because our intended outer table, dt already has a key on only tr from earlier, so our join will happen on that column only. Because of that, we need to either drop the key on the outer table, or reset the key (we chose the latter here):

setkey(dt, tr, ser, smp)
dt[dt.2]
#          tr   ser   smp    value class new.value
#  1: Trial 1 ser 1 smp 1 9.648542     A  4.134501
#  2: Trial 1 ser 1 smp 2 4.134501     A  9.648542
#  3: Trial 1 ser 2 smp 1 7.285704     A  1.393077
#  4: Trial 1 ser 2 smp 2 1.393077     A  7.285704
#  5: Trial 1 ser 3 smp 1 3.142587     A  1.012979
# ---                                             
# 20: Trial 4 ser 1 smp 2 5.160964     A  5.867905
# 21: Trial 4 ser 2 smp 1 2.432201     A  7.702306
# 22: Trial 4 ser 2 smp 2 7.702306     A  2.432201
# 23: Trial 4 ser 3 smp 1 2.671743     A  8.568685
# 24: Trial 4 ser 3 smp 2 8.568685     A  2.671743

Note that data.table carries out joins by matching key columns, that is - by matching the first key column of the outer table to the first column/key of the inner table, the second to the second, and so on, not considering column names (there's a FR here). If your tables / keys are not in the same order (as was the case here, if you noticed), you either need to re-order your columns or make sure that both tables have keys on the columns you want in the same order (what we did here). The reason the columns were not in the correct order is because of the first join we did to add the class in, which joined on tr and caused that column to become the first one in the data.table.

very nice. I would just point out that it's very easy and efficient to do any averaging you want with `apply()`, to collapse dimensions appropriately, **before** `melt`ing ... — Ben Bolker, Jan 16 '14 at 23:18
@BenBolker, yes absolutely, though I think it gets messy as soon as you need the `class` in this case. Simple margins and the like should definitely be done with `apply`. — BrodieG, Jan 16 '14 at 23:27
This is outstanding! The only thing that I still feel lacking on is how to conceive of the process of efficiently pulling data out to do linear algebra work on, and then re-inserting the result into the table, either in the same column, or in a new column. From where I'm at now I think I'd end up copying values out to a function call, storing them, and joining them again (I can't seem to get [function calls on complex groups](http://stackoverflow.com/questions/21156801/applying-non-trivial-functions-to-ordered-subsets-of-data-table) to work right, but that's another question). — bright-star, Jan 17 '14 at 01:38
@TrevorAlexander I'll add this to the answer tomorrow, but in short: `reshape2::acast` allows you to reconstitute arrays from the long format. So then you do your algebra, `melt` again as shown above, and finally join your result back to the table. — BrodieG, Jan 17 '14 at 01:43
@BenBolker It would be more efficient to use `rowSums`/`colSums` in conjunction with `aperm`. — Roland, Jan 17 '14 at 11:51
One should consider carefully (with large data), if the efficiency gain from working with an array is worth the time needed for the reshape from data.table to array. It could be more efficient to stay in the data.table framework. — Roland, Jan 17 '14 at 14:07
@Roland, agreed, unless the problem is difficult to formulate / slow in long format, and easy / fast in matrix format, you should probably stick to `data.table`. It is pretty darn fast. — BrodieG, Jan 17 '14 at 14:19
@Roland: I don't generally work with super-huge data, and I do things other than summing and taking means across array dimensions, so for me `apply()` works very nicely and transparently. — Ben Bolker, Jan 17 '14 at 14:19
@Roland In general can you get better performance out of a smart call by reference to a data.table than converting to array, doing an operation, and going back? — bright-star, Jan 17 '14 at 22:09
All this reshaping results in quite a few copies. And that costs performance, more so if the data is large. So, there is a trade-of against the performance gain from algebra. The larger your data is the more you need to care about copies. — Roland, Jan 17 '14 at 22:22

Troy · Accepted Answer · 2014-01-20T06:03:43.423

Maybe dplyr::tbl_cube ?

Working on from @BrodieG's excellent answer, I think that you may find it useful to look at the new functionality available from dplyr::tbl_cube. This is essentially a multidimensional object that you can easily create from a list of arrays (as you're currently using), which has some really good functions for subsetting, filtering and summarizing which (importantly, I think) are used consistently across the "cube" view and "tabular" view of the data.

require(dplyr)

Couple of caveats:

It's an early release: all the issues that go along with that
It's recommended for this version to unload plyr when dplyr is loaded

Loading arrays into cubes

Here's an example using arr as defined in the other answer:

# using arr from previous example
# we can convert it simply into a tbl_cube
arr.cube<-as.tbl_cube(arr)

arr.cube  
#Source: local array [24 x 3]  
#D: ser [chr, 3]  
#D: smp [chr, 2]  
#D: tr [chr, 4]  
#M: arr [dbl[3,2,4]]

So note that D means Dimensions and M Measures, and you can have as many as you like of each.

Easy conversion from multi-dimensional to flat

You can easily make the data tabular by returning it as a data.frame (which you can simply convert to a data.table if you need the functionality and performance benefits later)

head(as.data.frame(arr.cube))
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

Subsetting

You could obviously flatten all data for every operation, but that has many implications for performance and utility. I think the real benefit of this package is that you can "pre-mine" the cube for the data that you require before converting it into a tabular format that is ggplot-friendly, e.g. simple filtering to return only series 1:

arr.cube.filtered<-filter(arr.cube,ser=="ser 1")
as.data.frame(arr.cube.filtered)
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 1 smp 2 tr 1 0.9444435
#3 ser 1 smp 1 tr 2 0.4331116
#4 ser 1 smp 2 tr 2 0.3916376
#5 ser 1 smp 1 tr 3 0.4669228
#6 ser 1 smp 2 tr 3 0.8942300
#7 ser 1 smp 1 tr 4 0.2054326
#8 ser 1 smp 2 tr 4 0.1006973

tbl_cube currently works with the dplyr functions summarise(), select(), group_by() and filter(). Usefully you can chain these together with the %.% operator.

For the rest of the examples, I'm going to use the inbuilt nasa tbl_cube object, which has a bunch of meteorological data (and demonstrates multiple dimensions and measures):

Grouping and summary measures

nasa
#Source: local array [41,472 x 4]
#D: lat [dbl, 24]
#D: long [dbl, 24]
#D: month [int, 12]
#D: year [int, 6]
#M: cloudhigh [dbl[24,24,12,6]]
#M: cloudlow [dbl[24,24,12,6]]
#M: cloudmid [dbl[24,24,12,6]]
#M: ozone [dbl[24,24,12,6]]
#M: pressure [dbl[24,24,12,6]]
#M: surftemp [dbl[24,24,12,6]]
#M: temperature [dbl[24,24,12,6]]

So here is an example showing how easy it is to pull back a subset of modified data from the cube, and then flatten it so that it's appropriate for plotting:

plot_data<-as.data.frame(          # as.data.frame so we can see the data
filter(nasa,long<(-70)) %.%        # filter long < (-70) (arbitrary!)
group_by(lat,long) %.%             # group by lat/long combo
summarise(p.max=max(pressure),     # create summary measures for each group
          o.avg=mean(ozone),
          c.all=(cloudhigh+cloudlow+cloudmid)/3)
)

head(plot_data)

#       lat   long p.max    o.avg    c.all
#1 36.20000 -113.8   975 310.7778 22.66667
#2 33.70435 -113.8   975 307.0833 21.33333
#3 31.20870 -113.8   990 300.3056 19.50000
#4 28.71304 -113.8  1000 290.3056 16.00000
#5 26.21739 -113.8  1000 282.4167 14.66667
#6 23.72174 -113.8  1000 275.6111 15.83333

Consistent notation for n-d and 2-d data structures

Sadly the mutate() function isn't yet implemented for tbl_cube but looks like that will just be a matter of (not much) time. You can use it (and all the other functions that work on the cube) on the tabular result, though - with exactly the same notation. For example:

plot_data.mod<-filter(plot_data,lat>25) %.%    # filter out lat <=25
mutate(arb.meas=o.avg/p.max)                   # make a new column

head(plot_data.mod)

#       lat      long p.max    o.avg    c.all  arb.meas
#1 36.20000 -113.8000   975 310.7778 22.66667 0.3187464
#2 33.70435 -113.8000   975 307.0833 21.33333 0.3149573
#3 31.20870 -113.8000   990 300.3056 19.50000 0.3033389
#4 28.71304 -113.8000  1000 290.3056 16.00000 0.2903056
#5 26.21739 -113.8000  1000 282.4167 14.66667 0.2824167
#6 36.20000 -111.2957   930 313.9722 20.66667 0.3376045

Plotting - as an example of R functionality that "likes" flat data

Then you can plot with ggplot() using the benefits of flattened data:

# plot as you like:
ggplot(plot_data.mod) +
  geom_point(aes(lat,long,size=c.all,color=c.all,shape=cut(p.max,6))) +
  facet_grid( lat ~ long ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

enter image description here

Using data.table on the resulting flat data

I'm not going to expand on the use of data.table here, as it's done well in the previous answer. Obviously there are many good reasons to use data.table - for any situation here you can return one by a simple conversion of the data.frame:

data.table(as.data.frame(your_cube_name))

Working dynamically with your cube

Another thing I think is great is the ability to add measures (slices / scenarios / shifts, whatever you want to call them) to your cube. I think this will fit well with the method of analysis described in the question. Here's a simple example with arr.cube - adding an additional measure which is itself an (admittedly simple) function of the previous measure. You access/update measures through the syntax yourcube$mets[$...]

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

arr.cube$mets$arr.bump<-arr.cube$mets$arr*1.1  #arb modification!

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr  arr.bump
#1 ser 1 smp 1 tr 1 0.6656456 0.7322102
#2 ser 2 smp 1 tr 1 0.6181301 0.6799431
#3 ser 3 smp 1 tr 1 0.7335676 0.8069244
#4 ser 1 smp 2 tr 1 0.9444435 1.0388878
#5 ser 2 smp 2 tr 1 0.8977054 0.9874759
#6 ser 3 smp 2 tr 1 0.9361929 1.0298122

Dimensions - or not ...

I've played a little with trying to dynamically add entirely new dimensions (effectively scaling up an existing cube with additional dimensions and cloning or modifying the original data using yourcube$dims[$...]) but have found the behaviour to be a little inconsistent. Probably best to avoid this anyway, and structure your cube first before manipulating it. Will keep you posted if I get anywhere.

Persistance

Obviously one of the main issues with having interpreter access to a multidimensional database is the potential to accidentally bugger it with an ill-timed keystroke. So I guess just persist early and often:

tempfilename<-gsub("[ :-]","",paste0("DBX",(Sys.time()),".cub"))
# save:
save(arr.cube,file=tempfilename)
# load:
load(file=tempfilename)

Hope that helps!

This is an excellent answer and super informative! If I were another six months down the road with R, I feel like I'd even be able to make the most out of it :( Do you have any info on comparative speeds/memory usage with this structure? `data.table` is really enticing with its referential calls. — bright-star, Jan 21 '14 at 08:14
@TrevorAlexander it looks like data.table and dplyr are broadly equivalent in terms of performance: some things faster/slower in each. (This is from the docs and some limited benchmarking I've done myself.) I like the syntax in dplyr, but that's personal preference! Also, you can convert any of the output of dplyr to data.table on the fly at any time if you want to use that. I'd be tempted to store it in a cube, and unroll the bits you want for analysis and reporting on the fly, but depends a bit on your data. What's the application? — Troy, Jan 21 '14 at 14:40
@Troy, is it possible for you to share some of your benchmarks where you find `data.table` slower? (perhaps a gist and share the link)? It'd be nice to see if there can be any improvements. — Arun, Jan 21 '14 at 20:16
My main application is epoched multivariate time series. So each epoch is like a square slice, with series as column vectors. This sounds like a good fit, but I'd have to experiment. — bright-star, Jan 21 '14 at 21:01

Methodology of high-dimensional data structuring in R vs. MATLAB

Question

Explanation

MATLAB

R

2 Answers2

Array -> Data Table

Plotting From Long Format

Data Table -> Array -> Data Table

Maybe dplyr::tbl_cube ?

Loading arrays into cubes

Easy conversion from multi-dimensional to flat

Subsetting

Grouping and summary measures

Consistent notation for n-d and 2-d data structures

Plotting - as an example of R functionality that "likes" flat data

Using data.table on the resulting flat data

Working dynamically with your cube

Dimensions - or not ...

Persistance

Linked