As has been pointed out, many of the more powerful analytical and visualization tools rely on data in long format. Certainly for transformations that benefit from matrix algebra you should keep stuff in arrays, but as soon as you're wanting run parallel analysis on subsets of your data, or plot stuff by factors in your data, you really want to melt
.
Here is an example to get you started with data.table
and ggplot
.
Array -> Data Table
First, let's make some data in your format:
series <- 3
samples <- 2
trials <- 4
trial.labs <- paste("tr", seq(len=trials))
trial.class <- sample(c("A", "B"), trials, rep=T)
arr <- array(
runif(series * samples * trials),
dim=c(series, samples, trials),
dimnames=list(
ser=paste("ser", seq(len=series)),
smp=paste("smp", seq(len=samples)),
tr=trial.labs
)
)
# , , tr = Trial 1
# smp
# ser smp 1 smp 2
# ser 1 0.9648542 0.4134501
# ser 2 0.7285704 0.1393077
# ser 3 0.3142587 0.1012979
#
# ... omitted 2 trials ...
#
# , , tr = Trial 4
# smp
# ser smp 1 smp 2
# ser 1 0.5867905 0.5160964
# ser 2 0.2432201 0.7702306
# ser 3 0.2671743 0.8568685
Now we have a 3 dimensional array. Let's melt
and turn it into a data.table
(note melt
operates on data.frames
, which are basically data.table
s sans bells & whistles, so we have to first melt, then convert to data.table
):
library(reshape2)
library(data.table)
dt.raw <- data.table(melt(arr), key="tr") # we'll get to what the `key` arg is doing later
# ser smp tr value
# 1: ser 1 smp 1 tr 1 0.53178276
# 2: ser 2 smp 1 tr 1 0.28574271
# 3: ser 3 smp 1 tr 1 0.62991366
# 4: ser 1 smp 2 tr 1 0.31073376
# 5: ser 2 smp 2 tr 1 0.36098971
# ---
# 20: ser 2 smp 1 tr 4 0.38049334
# 21: ser 3 smp 1 tr 4 0.14170226
# 22: ser 1 smp 2 tr 4 0.63719962
# 23: ser 2 smp 2 tr 4 0.07100314
# 24: ser 3 smp 2 tr 4 0.11864134
Notice how easy this was, with all our dimension labels trickling through to the long format. One of the bells & whistles of data.tables
is the ability to do indexed merges between data.table
s (much like MySQL indexed joins). So here, we will do that to bind the class
to our data:
dt <- dt.raw[J(trial.labs, class=trial.class)] # on the fly mapping of trials to class
# tr ser smp value class
# 1: Trial 1 ser 1 smp 1 0.9648542 A
# 2: Trial 1 ser 2 smp 1 0.7285704 A
# 3: Trial 1 ser 3 smp 1 0.3142587 A
# 4: Trial 1 ser 1 smp 2 0.4134501 A
# 5: Trial 1 ser 2 smp 2 0.1393077 A
# ---
# 20: Trial 4 ser 2 smp 1 0.2432201 A
# 21: Trial 4 ser 3 smp 1 0.2671743 A
# 22: Trial 4 ser 1 smp 2 0.5160964 A
# 23: Trial 4 ser 2 smp 2 0.7702306 A
# 24: Trial 4 ser 3 smp 2 0.8568685 A
A few things to understand:
J
creates a data.table
from vectors
- attempting to subset the rows of one
data.table
with another data table (i.e. using a data.table
as the first argument after the brace in [.data.table
) causes data.table
to left join (in MySQL parlance) the outer table (dt
in this case) to the inner table (the one created on the fly by J
) in this case. The join is done on the key
column(s) of the outer data.table
, which as you may have noticed we defined in the melt
/data.table
conversion step earlier.
You'll have to read the documentation to fully understand what's going on, but think of J(trial.labs, class=trial.class)
being effectively equivalent to creating a data.table
with data.table(trial.labs, class=trial.class)
, except J
only works when used inside [.data.table
.
So now, in one easy step we have our class data attached to the values. Again, if you need matrix algebra, operate on your array first, and then in two or three easy commands switch back to the long format. As noted in the comments, you probably don't want to be going back and forth from the long to array formats unless you have a really good reason to be doing so.
Once things are in data.table
, you can group/aggregate your data (similar to the concept of split-apply-combine style) quite easily. Suppose we want to get summary statistics for each class
-sample
combination:
dt[, as.list(summary(value)), by=list(class, smp)]
# class smp Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1: A smp 1 0.08324 0.2537 0.3143 0.4708 0.7286 0.9649
# 2: A smp 2 0.10130 0.1609 0.5161 0.4749 0.6894 0.8569
# 3: B smp 1 0.14050 0.3089 0.4773 0.5049 0.6872 0.8970
# 4: B smp 2 0.08294 0.1196 0.1562 0.3818 0.5313 0.9063
Here, we just give data.table
an expression (as.list(summary(value))
) to evaluate for every class
, smp
subset of the data (as specified in the by
expression). We need as.list
so that the results are re-assembled by data.table
as columns.
You could just as easily have calculated moments (e.g. list(mean(value), var(value), (value - mean(value))^3
) for any combination of the class/sample/trial/series variables.
If you want to do simple transformations to the data it is very easy with data.table
:
dt[, value:=value * 10] # modify in place with `:=`, very efficient
dt[1:2] # see, `value` now 10x
# tr ser smp value class
# 1: Trial 1 ser 1 smp 1 9.648542 A
# 2: Trial 1 ser 2 smp 1 7.285704 A
This is an in-place transformation, so there are no memory copies, which makes it fast. Generally data.table
tries to use memory as efficiently as possible and as such is one of the fastest ways to do this type of analysis.
Plotting From Long Format
ggplot
is fantastic for plotting data in long format. I won't get into the details of what's happening, but hopefully the images will give you an idea of what you can do
library(ggplot2)
ggplot(data=dt, aes(x=ser, y=smp, color=class, size=value)) +
geom_point() +
facet_wrap( ~ tr)

ggplot(data=dt, aes(x=tr, y=value, fill=class)) +
geom_bar(stat="identity") +
facet_grid(smp ~ ser)

ggplot(data=dt, aes(x=tr, y=paste(ser, smp))) +
geom_tile(aes(fill=value)) +
geom_point(aes(shape=class), size=5) +
scale_fill_gradient2(low="yellow", high="blue", midpoint=median(dt$value))

Data Table -> Array -> Data Table
First we need to acast
(from package reshape2
) our data table back to an array:
arr.2 <- acast(dt, ser ~ smp ~ tr, value.var="value")
dimnames(arr.2) <- dimnames(arr) # unfortunately `acast` doesn't preserve dimnames properly
# , , tr = Trial 1
# smp
# ser smp 1 smp 2
# ser 1 9.648542 4.134501
# ser 2 7.285704 1.393077
# ser 3 3.142587 1.012979
# ... omitted 3 trials ...
At this point, arr.2
looks just like arr
did, except with values multiplied by 10. Note we had to drop the class
column. Now, let's do some trivial matrix algebra
shuff.mat <- matrix(c(0, 1, 1, 0), nrow=2) # re-order columns
for(i in 1:dim(arr.2)[3]) arr.2[, , i] <- arr.2[, , i] %*% shuff.mat
Now, let's go back to long format with melt
. Note the key
argument:
dt.2 <- data.table(melt(arr.2, value.name="new.value"), key=c("tr", "ser", "smp"))
Finally, let's join back dt
and dt.2
. Here you need to be careful. The behavior of data.table
is that the inner table will be joined to the outer table based on all the keys of the inner table if the outer table has no keys. If the inner table has keys, data.table
will join key to key. This is a problem here because our intended outer table, dt
already has a key on only tr
from earlier, so our join will happen on that column only. Because of that, we need to either drop the key on the outer table, or reset the key (we chose the latter here):
setkey(dt, tr, ser, smp)
dt[dt.2]
# tr ser smp value class new.value
# 1: Trial 1 ser 1 smp 1 9.648542 A 4.134501
# 2: Trial 1 ser 1 smp 2 4.134501 A 9.648542
# 3: Trial 1 ser 2 smp 1 7.285704 A 1.393077
# 4: Trial 1 ser 2 smp 2 1.393077 A 7.285704
# 5: Trial 1 ser 3 smp 1 3.142587 A 1.012979
# ---
# 20: Trial 4 ser 1 smp 2 5.160964 A 5.867905
# 21: Trial 4 ser 2 smp 1 2.432201 A 7.702306
# 22: Trial 4 ser 2 smp 2 7.702306 A 2.432201
# 23: Trial 4 ser 3 smp 1 2.671743 A 8.568685
# 24: Trial 4 ser 3 smp 2 8.568685 A 2.671743
Note that data.table
carries out joins by matching key columns, that is - by matching the first key column of the outer table to the first column/key of the inner table, the second to the second, and so on, not considering column names (there's a FR here). If your tables / keys are not in the same order (as was the case here, if you noticed), you either need to re-order your columns or make sure that both tables have keys on the columns you want in the same order (what we did here). The reason the columns were not in the correct order is because of the first join we did to add the class in, which joined on tr
and caused that column to become the first one in the data.table
.