1

I have a vector of information, say:

Info = [10, 20, 10, 30, 500, 400, 67, 350, 20, 105, 15];

and another a vector of IDs, say:

Info_IDs = [1, 2, 1, 4, 2, 3, 4, 1, 3, 1, 2];

I would like to obtain a matrix that is defined as follows:

Result =
    10    10   350   105
    20   500    15     0
   400    20     0     0
    30    67     0     0

Where every row shows the values of Info corresponding to a different ID. As seen in this short example, the number of values per ID differs in each row.

I'm working with large amounts of data (Info is 1x1000000 and Info_IDs is 1x25000), so I would like to achieve this Result matrix preferably without loops. One way I was thinking about is to compute the histogram per ID and store this info (therefore Result would not contain the original info, but the binned info).

Thank you all in advance for your input.

Eitan T
  • 32,660
  • 14
  • 72
  • 109
Piwie
  • 33
  • 4

3 Answers3

1

Here's a vectorized solution that should be both memory efficient and work fast even on large matrices:

%// Pad data with zero values and add matching IDs
len = histc(Info_IDs, 1:max(Info_IDs));
padlen = max(len) - len;
padval = zeros(1, sum(padlen));
padval(cumsum([1, padlen(1:end - 1)])) = 1;
Info = [Info, zeros(1, sum(padlen))];
Info_IDs = [Info_IDs, cumsum(padval) + 1];

%// Group data into rows
Result = accumarray(Info_IDs(:), Info, [], @(x){x}).';
Result = [Result{:}].';

The second step can also be performed as follows:

%// Group data into rows
[sorted_IDs, sorted_idx] = sort(Info_IDs);
Result = reshape(Info(sorted_idx), numel(len), []).';

Example

%// Sample input data
Info = [10 20 10 30 500 400 67 350 20 105 15];
Info_IDs = [1 2 1 4 2 3 4 1 3 1 2];

%// Pad data with zero values and add matching IDs
len = histc(Info_IDs, 1:max(Info_IDs));
padlen = max(len) - len;
padval = zeros(1, sum(padlen));
padval(cumsum([1, padlen(1:end - 1)])) = 1;
Info = [Info, zeros(1, sum(padlen))]
Info_IDs = [Info_IDs, cumsum(padval) + 1]

%// Group data into rows
Result = accumarray(Info_IDs(:), Info, [], @(x){x}).';
Result = [Result{:}].';

The result is:

Result =
    10    10   350   105
    20   500    15     0
   400    20     0     0
    30    67     0     0
Eitan T
  • 32,660
  • 14
  • 72
  • 109
  • Thank you for your suggestion. I just wrote an edit on my code. – Piwie Aug 06 '13 at 08:23
  • @Piwie Why do you say that this code cannot handle large amounts of data? Also, there's no need to copy-paste my answer into your question, it's superfluous. – Eitan T Aug 06 '13 at 08:27
  • I edited my question with the solution I derived from your suggestion. However, it is not the same. I thought that by placing what I used I could highlight whre the bottleneck of the solution lies. If you check my code: `Results = zeros(1,max(n)*size(Info,2));` This resuts is a huge matrix of mainly zeros at the end of the processing. My computer cannot handle to initialize this zeros matrix at first place. – Piwie Aug 06 '13 at 08:51
  • Thank you for the edits of the question. It is much clearer this way! It is the first time that I post a question here. Thanks. – Piwie Aug 06 '13 at 08:53
  • @Piwie You're welcome :) Regarding your question: if you declare a huge matrix of zeros (not mainly, _entirely_), MATLAB is likely to complain. But there is no need to preallocate variables like that in my solution (MATLAB does that automatically for you). Doesn't it work for you as is? – Eitan T Aug 06 '13 at 09:44
0

I don't know about not using loops but this is pretty fast:

Result = [];
n = 4; %i.e.  number of classes
for c = 1:n 
    row = Info(Info_IDs == c);
    Result (c, 1:size(row,2)) = row;
end

And if speed really is an issue then you can preallocate as Result = zeros(4, sum(Info_IDs == mode(Info_IDs)))

Dan
  • 45,079
  • 17
  • 88
  • 157
  • Thanks! That is what I was doing now ... but I find this slow. I have an IDs vector of 25.000 elements, and a Info vector of 1.000.000 elements. – Piwie Aug 05 '13 at 16:02
  • @Piwie but how many classes? Try it with the preallocation I suggested. If you have only a few classes then it should be pretty fast. – Dan Aug 05 '13 at 16:03
  • I have 25.000 unique classes (my IDs), a Info vector of 1.000.000 elements. I tries to allocate before, and I tried by looking at the problem as a large view (zeros(25000x1000000). --> Out of memory of course. I will check your method, might be better. Thanks! – Piwie Aug 05 '13 at 16:15
0

If you don't mind to have zeros in between:

number_Ids = 4; % set as required
aux = (bsxfun(@eq,Info_IDs,(1:number_Ids).'));
sol = bsxfun(@(x,y) x.*y,Info,aux)

This gives, in your example:

10     0    10     0     0     0     0   350     0   105     0
 0    20     0     0   500     0     0     0     0     0    15
 0     0     0     0     0   400     0     0    20     0     0
 0     0     0    30     0     0    67     0     0     0     0

Or, if you do mind the zeros but not the order, you can sort this result by rows:

sol2 = sort(sol,2,'descend')

which gives

350   105    10    10     0     0     0     0     0     0     0
500    20    15     0     0     0     0     0     0     0     0
400    20     0     0     0     0     0     0     0     0     0
 67    30     0     0     0     0     0     0     0     0     0

EDIT: the order of the non-zero entries can be preserved using the same trick as here

Community
  • 1
  • 1
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • Thank you for this solution. As I mentionned before, this solution would technically result in a 25.000 x 1.000.000 matrix, mostly filled with zeros. I cannot afford such computation effort as I have loads of data to process. I process them in chunks of 1.000.000. Thanks for the suggestion. – Piwie Aug 06 '13 at 07:54