Most efficient format for array data for R import?

Question

I'm in the enviable position of being able to set up the format for my data collection ahead of time, rather than being handed some crazy format and having to struggle with it. I'd like to make sure I'm setting it up in a way that minimizes headaches down the road, but I'm not very familiar with importing into multidimensional arrays so I'd like input. It also seems like a thought exercise that others might get some use from.

I am compiling a large number of data summaries (500+) with 23 single data values for each experiment and two additional vectors that vary between 100 and 1500 data values (these two vectors happen to always match in length for each sample, but their length is different for each sample). I'm having to store all of these in an Excel sheet which I'm currently building. I want to set it up in a way that efficiently stores this data for import into an R array.

I'm assuming that the longer dimensions, which vary in length, will have the max length (1500) and a bunch of NA's at the end rather than try to keep track of ragged data in Excel.

My current plan would be to store these in long form in Excel, with data labels in the first column (dim1, dim2,...), and the data summaries in each subsequent column (a, b, c...), since this saves the most space. Using a smaller number of dimensions as an example (7 single values, 2 vectors of length 1500), the data would look like this in Excel:

     a b c...
dim1 2 5 7...
dim2 3 6 8...
dim3 6 8 2 ...
dim4 5 6 1... 
dim5 6 2 1...
dim6 0 3 8...
dim7 8 5 4...
dim8 1 1 1...
dim8 2 2 2 ...
... continued x1500
dim9 4 4 4...
dim9 5 5 5 ...
...continued x1500

Can I easily import this, using the leftmost column to identify the dimensions of the array in long form? I don't see an easy way to do this using Reshape2, but perhaps I'm missing something. Or, do I need to have the data in paired columns?

It isn't clear to me whether this format is the most efficient way to organize this data for import into a multidimensional array, or if there is a better way. Eventually there will be a large number of samples so I'd like to think through this now rather than struggle later.

What is the most painless way to import this...or, is there a more efficient way of setting it up for easier import?

score 0 · Accepted Answer · answered Feb 18 '16 at 18:46

0

Hmm.. I can't think of a case that you would have to use melt. If you keep the current format, and add a heading to the 'dim' column then you should be able to work with that data fairly easily.

If you did transpose the data on 'dim' I think it would make things a lot more difficult.

It might good to know what variable types a,b,c,etc. are in order to make a better assessment.

answered Feb 18 '16 at 18:46

cgage1

579
5
15

in my data dim 1-7 are a mix of factor and numeric. Dim 8 and dim 9 are all numeric. So for the sake of argument, in the example dim1:dim4 are factors, dim4:dim7 are numeric. So, a, b, c... are all columns of data for an individual experiment, with the leftmost row identifying which observation it is. If it weren't for dim8 and dim9 that are 1500 cells long it would just be a simple matrix. – JHegg Feb 18 '16 at 19:10
Ah, I see. In that case i think it might be better to transpose the data table. That way each row represents an individual experiment, and each column is a variable of that experiment. As for dim8 and dim9, perhaps create an additional table with the details of those, and when the time comes, you can join the two. This may be the easiest layout for recording the data. You will have to add some additional lines of code to join the two tables as well, but that shouldn't be difficult as long as you have the a unique key in each table to be joined on. – cgage1 Feb 18 '16 at 19:43
I was hoping there might be a way to import it all together, but the more I look at it I don't think that's possible. You are probably correct. – JHegg Feb 21 '16 at 19:35

Most efficient format for array data for R import?

1 Answers1