1

I have two big matrices. Matrix A which is [4144514 x 3] and matrix B which is [51962 x 17].

The first three columns of A and B have the identifiers. On matrix B the 3 columns make a unique identifier, but this might be repeated on A.

I want to merge the two matrices, in such a away that the resulting matrix A, which should be [4144514 x 20], i.e. I am merging the 17 columns of B with matrix A given the criteria on the first three columns of each matrix.

This is the loop I am doing:

for i=1:size(B,1)
    aux = sum(A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3));
    A(A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3),4:20) = repmat(B(i,:),aux,1);
end

The variable aux tells me how many lines in A match the criteria from B.

The second line of the loop, creates a matrix with the columns of B repeated for aux number of times, and puts them on the correct place in A. However this is extremely inefficient.

Let me give you a toy example below with smaller matrices. I will have A to be [108 x 3] and B to be [27 x 17].

What I am looking to do is the following:

A = [100, 1, 2000 ;100, 1, 2000 ;100, 1, 2000 ;100, 1, 2000 ;100, 2, 2000 ;100, 2, 2000 ;100, 2, 2000 ;100, 2, 2000 ;100, 3, 2000 ;100, 3, 2000 ;100, 3, 2000 ;100, 3, 2000 ;100, 1, 2001 ;100, 1, 2001 ;100, 1, 2001 ;100, 1, 2001 ;100, 2, 2001 ;100, 2, 2001 ;100, 2, 2001 ;100, 2, 2001 ;100, 3, 2001 ;100, 3, 2001 ;100, 3, 2001 ;100, 3, 2001 ;100, 1, 2002 ;100, 1, 2002 ;100, 1, 2002 ;100, 1, 2002 ;100, 2, 2002 ;100, 2, 2002 ;100, 2, 2002 ;100, 2, 2002 ;100, 3, 2002 ;100, 3, 2002 ;100, 3, 2002 ;100, 3, 2002 ;101, 1, 2000 ;101, 1, 2000 ;101, 1, 2000 ;101, 1, 2000 ;101, 2, 2000 ;101, 2, 2000 ;101, 2, 2000 ;101, 2, 2000 ;101, 3, 2000 ;101, 3, 2000 ;101, 3, 2000 ;101, 3, 2000 ;101, 1, 2001 ;101, 1, 2001 ;101, 1, 2001 ;101, 1, 2001 ;101, 2, 2001 ;101, 2, 2001 ;101, 2, 2001 ;101, 2, 2001 ;101, 3, 2001 ;101, 3, 2001 ;101, 3, 2001 ;101, 3, 2001 ;101, 1, 2002 ;101, 1, 2002 ;101, 1, 2002 ;101, 1, 2002 ;101, 2, 2002 ;101, 2, 2002 ;101, 2, 2002 ;101, 2, 2002 ;101, 3, 2002 ;101, 3, 2002 ;101, 3, 2002 ;101, 3, 2002 ;103, 1, 2000 ;103, 1, 2000 ;103, 1, 2000 ;103, 1, 2000 ;103, 2, 2000 ;103, 2, 2000 ;103, 2, 2000 ;103, 2, 2000 ;103, 3, 2000 ;103, 3, 2000 ;103, 3, 2000 ;103, 3, 2000 ;103, 1, 2001 ;103, 1, 2001 ;103, 1, 2001 ;103, 1, 2001 ;103, 2, 2001 ;103, 2, 2001 ;103, 2, 2001 ;103, 2, 2001 ;103, 3, 2001 ;103, 3, 2001 ;103, 3, 2001 ;103, 3, 2001 ;103, 1, 2002 ;103, 1, 2002 ;103, 1, 2002 ;103, 1, 2002 ;103, 2, 2002 ;103, 2, 2002 ;103, 2, 2002 ;103, 2, 2002 ;103, 3, 2002 ;103, 3, 2002 ;103, 3, 2002 ;103, 3, 2002];
B = [100, 1, 2000, 8, 7, 9, 10, 1, 2, 9, 2, 1, 3, 3, 3, 9, 7; 100, 2, 2000, 8, 2, 7, 2, 7, 5, 5, 9, 2, 7, 1, 2, 6, 4; 100, 3, 2000, 8, 8, 7, 3, 2, 8, 1, 10, 9, 8, 6, 1, 5, 7; 100, 1, 2001, 10, 10, 1, 7, 2, 5, 5, 8, 6, 5, 3, 6, 6, 4; 100, 2, 2001, 6, 7, 3, 1, 5, 3, 9, 9, 3, 8, 1, 6, 4, 4; 100, 3, 2001, 1, 5, 7, 1, 5, 4, 2, 10, 5, 4, 6, 5, 1, 10; 100, 1, 2002, 7, 4, 6, 4, 7, 8, 3, 7, 7, 8, 2, 1, 6, 2; 100, 2, 2002, 8, 7, 8, 10, 2, 10, 8, 4, 7, 5, 10, 4, 4, 2; 100, 3, 2002, 4, 9, 4, 10, 2, 4, 2, 1, 4, 10, 9, 2, 6, 9; 101, 1, 2000, 5, 9, 5, 3, 10, 1, 4, 2, 10, 2, 6, 8, 5, 4; 101, 2, 2000, 6, 1, 8, 10, 10, 7, 4, 6, 5, 2, 8, 3, 2, 1; 101, 3, 2000, 7, 7, 5, 6, 2, 8, 8, 8, 1, 6, 1, 1, 9, 7; 101, 1, 2001, 8, 4, 5, 5, 8, 7, 2, 2, 9, 8, 1, 4, 2, 1; 101, 2, 2001, 3, 7, 10, 4, 9, 9, 1, 1, 10, 7, 6, 5, 10, 9; 101, 3, 2001, 10, 8, 2, 4, 6, 1, 6, 4, 8, 10, 7, 9, 4, 7; 101, 1, 2002, 6, 9, 3, 2, 10, 8, 5, 2, 6, 9, 1, 3, 6, 6; 101, 2, 2002, 8, 3, 8, 4, 4, 3, 4, 2, 4, 7, 2, 10, 3, 3; 101, 3, 2002, 10, 2, 10, 10, 6, 5, 3, 5, 10, 1, 3, 4, 8, 5; 103, 1, 2000, 8, 3, 9, 9, 4, 3, 3, 9, 7, 7, 6, 5, 2, 6; 103, 2, 2000, 6, 7, 5, 5, 7, 10, 5, 3, 5, 4, 7, 8, 9, 7; 103, 3, 2000, 3, 4, 9, 10, 3, 10, 7, 2, 10, 3, 3, 3, 6, 6; 103, 1, 2001, 7, 7, 1, 7, 10, 7, 10, 7, 9, 8, 4, 7, 6, 2; 103, 2, 2001, 5, 5, 4, 3, 7, 7, 6, 5, 2, 5, 5, 6, 6, 5; 103, 3, 2001, 6, 3, 9, 9, 2, 10, 10, 10, 10, 7, 10, 9, 9, 8; 103, 1, 2002, 5, 10, 2, 8, 6, 5, 7, 6, 4, 3, 6, 8, 7, 4; 103, 2, 2002, 10, 7, 6, 3, 10, 4, 5, 5, 1, 3, 1, 9, 1, 5; 103, 3, 2002, 2, 1, 5, 5, 2, 8, 6, 2, 6, 6, 10, 1, 4, 9];

for i=1:size(B,1)
        aux = sum(A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3));
        A(A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3),4:20) = repmat(B(i,:),aux,1);
end

With this smaller example the code runs very fast. But as soon as the matrices get as big as the ones I have, it takes ages. Is there any faster way to do this?

Cœur
  • 37,241
  • 25
  • 195
  • 267
phdstudent
  • 1,060
  • 20
  • 41
  • Maybe `par for` would be a good way to go. – Jake0x32 Feb 24 '16 at 21:17
  • Yes, that would be one way to go. But given that I only have 4 cores on my PC, it would reduce the timing by only 1/4. There should be a powerful way to do this is matlab using vectorization or something similar ... – phdstudent Feb 24 '16 at 21:19
  • You are calculating the same expression twice, calculate it once and assign the result to a variable. – Daniel Feb 24 '16 at 21:35
  • I mean this part `A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3)` it is evaluated twice – Daniel Feb 24 '16 at 21:39
  • 1
    Apologies for not being clear on that. The 3 identifiers are unique on matrix B, but matrix A might have them repeated. So I want to repeat the lines of B, to match with A. The toy example above illustrates this. @Daniel, I agree that part is evaluated twice, and I think there is an easy way to change that. Still, the code would be very slow... – phdstudent Feb 24 '16 at 21:40
  • Is there any purpose in duplicating the first three columns? If I see the code right `A(:,1:3)==A(:,4:6)`. I know that such little things don't increase the performance drastically, but simplifying the code also helps finding a vectorisation. – Daniel Feb 24 '16 at 21:47
  • Also, no purpose on that. I was just trying to make the it simpler. – phdstudent Feb 24 '16 at 21:51
  • @Jake0x32: Parfor probably won't help here. It's a typical case of [simple operations and big data](http://stackoverflow.com/questions/3174358/matlab-parfor-is-slower-than-for-what-is-wrong) where the communication overhead would make it slower than a for loop. – Daniel Feb 24 '16 at 22:08

1 Answers1

1

A very simple task to vectorize, ismember does all the work.

%your example
A = [100, 1, 2000 ;100, 1, 2000 ;100, 1, 2000 ;100, 1, 2000 ;100, 2, 2000 ;100, 2, 2000 ;100, 2, 2000 ;100, 2, 2000 ;100, 3, 2000 ;100, 3, 2000 ;100, 3, 2000 ;100, 3, 2000 ;100, 1, 2001 ;100, 1, 2001 ;100, 1, 2001 ;100, 1, 2001 ;100, 2, 2001 ;100, 2, 2001 ;100, 2, 2001 ;100, 2, 2001 ;100, 3, 2001 ;100, 3, 2001 ;100, 3, 2001 ;100, 3, 2001 ;100, 1, 2002 ;100, 1, 2002 ;100, 1, 2002 ;100, 1, 2002 ;100, 2, 2002 ;100, 2, 2002 ;100, 2, 2002 ;100, 2, 2002 ;100, 3, 2002 ;100, 3, 2002 ;100, 3, 2002 ;100, 3, 2002 ;101, 1, 2000 ;101, 1, 2000 ;101, 1, 2000 ;101, 1, 2000 ;101, 2, 2000 ;101, 2, 2000 ;101, 2, 2000 ;101, 2, 2000 ;101, 3, 2000 ;101, 3, 2000 ;101, 3, 2000 ;101, 3, 2000 ;101, 1, 2001 ;101, 1, 2001 ;101, 1, 2001 ;101, 1, 2001 ;101, 2, 2001 ;101, 2, 2001 ;101, 2, 2001 ;101, 2, 2001 ;101, 3, 2001 ;101, 3, 2001 ;101, 3, 2001 ;101, 3, 2001 ;101, 1, 2002 ;101, 1, 2002 ;101, 1, 2002 ;101, 1, 2002 ;101, 2, 2002 ;101, 2, 2002 ;101, 2, 2002 ;101, 2, 2002 ;101, 3, 2002 ;101, 3, 2002 ;101, 3, 2002 ;101, 3, 2002 ;103, 1, 2000 ;103, 1, 2000 ;103, 1, 2000 ;103, 1, 2000 ;103, 2, 2000 ;103, 2, 2000 ;103, 2, 2000 ;103, 2, 2000 ;103, 3, 2000 ;103, 3, 2000 ;103, 3, 2000 ;103, 3, 2000 ;103, 1, 2001 ;103, 1, 2001 ;103, 1, 2001 ;103, 1, 2001 ;103, 2, 2001 ;103, 2, 2001 ;103, 2, 2001 ;103, 2, 2001 ;103, 3, 2001 ;103, 3, 2001 ;103, 3, 2001 ;103, 3, 2001 ;103, 1, 2002 ;103, 1, 2002 ;103, 1, 2002 ;103, 1, 2002 ;103, 2, 2002 ;103, 2, 2002 ;103, 2, 2002 ;103, 2, 2002 ;103, 3, 2002 ;103, 3, 2002 ;103, 3, 2002 ;103, 3, 2002];
B = [100, 1, 2000, 8, 7, 9, 10, 1, 2, 9, 2, 1, 3, 3, 3, 9, 7; 100, 2, 2000, 8, 2, 7, 2, 7, 5, 5, 9, 2, 7, 1, 2, 6, 4; 100, 3, 2000, 8, 8, 7, 3, 2, 8, 1, 10, 9, 8, 6, 1, 5, 7; 100, 1, 2001, 10, 10, 1, 7, 2, 5, 5, 8, 6, 5, 3, 6, 6, 4; 100, 2, 2001, 6, 7, 3, 1, 5, 3, 9, 9, 3, 8, 1, 6, 4, 4; 100, 3, 2001, 1, 5, 7, 1, 5, 4, 2, 10, 5, 4, 6, 5, 1, 10; 100, 1, 2002, 7, 4, 6, 4, 7, 8, 3, 7, 7, 8, 2, 1, 6, 2; 100, 2, 2002, 8, 7, 8, 10, 2, 10, 8, 4, 7, 5, 10, 4, 4, 2; 100, 3, 2002, 4, 9, 4, 10, 2, 4, 2, 1, 4, 10, 9, 2, 6, 9; 101, 1, 2000, 5, 9, 5, 3, 10, 1, 4, 2, 10, 2, 6, 8, 5, 4; 101, 2, 2000, 6, 1, 8, 10, 10, 7, 4, 6, 5, 2, 8, 3, 2, 1; 101, 3, 2000, 7, 7, 5, 6, 2, 8, 8, 8, 1, 6, 1, 1, 9, 7; 101, 1, 2001, 8, 4, 5, 5, 8, 7, 2, 2, 9, 8, 1, 4, 2, 1; 101, 2, 2001, 3, 7, 10, 4, 9, 9, 1, 1, 10, 7, 6, 5, 10, 9; 101, 3, 2001, 10, 8, 2, 4, 6, 1, 6, 4, 8, 10, 7, 9, 4, 7; 101, 1, 2002, 6, 9, 3, 2, 10, 8, 5, 2, 6, 9, 1, 3, 6, 6; 101, 2, 2002, 8, 3, 8, 4, 4, 3, 4, 2, 4, 7, 2, 10, 3, 3; 101, 3, 2002, 10, 2, 10, 10, 6, 5, 3, 5, 10, 1, 3, 4, 8, 5; 103, 1, 2000, 8, 3, 9, 9, 4, 3, 3, 9, 7, 7, 6, 5, 2, 6; 103, 2, 2000, 6, 7, 5, 5, 7, 10, 5, 3, 5, 4, 7, 8, 9, 7; 103, 3, 2000, 3, 4, 9, 10, 3, 10, 7, 2, 10, 3, 3, 3, 6, 6; 103, 1, 2001, 7, 7, 1, 7, 10, 7, 10, 7, 9, 8, 4, 7, 6, 2; 103, 2, 2001, 5, 5, 4, 3, 7, 7, 6, 5, 2, 5, 5, 6, 6, 5; 103, 3, 2001, 6, 3, 9, 9, 2, 10, 10, 10, 10, 7, 10, 9, 9, 8; 103, 1, 2002, 5, 10, 2, 8, 6, 5, 7, 6, 4, 3, 6, 8, 7, 4; 103, 2, 2002, 10, 7, 6, 3, 10, 4, 5, 5, 1, 3, 1, 9, 1, 5; 103, 3, 2002, 2, 1, 5, 5, 2, 8, 6, 2, 6, 6, 10, 1, 4, 9];
%reference code, changed output to C to preserve input
C=A;
for i=1:size(B,1)
        aux = sum(A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3));
        C(A(:,1)==B(i,1) & A(:,2)==B(i,2) & A(:,3) == B(i,3),4:20) = repmat(B(i,:),aux,1);
end
%vectorized version
[~,lia]=ismember(A(:,1:3),B(:,1:3),'rows');
C2=nan(numel(lia),size(B,2)+3);
C2(lia>0,:)=B(lia(lia>0),[1,2,3,1:end]);
toc;
tic;
%simplified vectorized version, assuming there is no need to duplicate the first three rows:
[~,lia]=ismember(A(:,1:3),B(:,1:3),'rows');
C3=nan(numel(lia),size(B,2));
C3(lia>0,:)=B(lia(lia>0),:);
toc;

Comparing the performance with some example data close to your real data (same data was to large):

n=100000 %4144514
m=10000 %51962
B=rand(10000,17);
B=unique(B,'rows');
A=B(randi([1 size(B,1)],100000,1),1:3);

Decreased the execution time from 14s to less 0.05s

Daniel
  • 36,610
  • 3
  • 36
  • 69
  • Thank you for your answer. It should work, I am just having a small issue. Sometimes there are data points in A which have no correspondence in B (e.g. the vector "lia" has some 0 values) which makes the last step (C3=B(lia,:);) have an error: "Subscript indices must either be real positive integers or logicals.". I tried to replace the zeros with NaN, but didn't work. If there is no match, the result should be NaN. – phdstudent Feb 24 '16 at 21:59
  • This was impressive. Thank you very much. – phdstudent Feb 24 '16 at 22:07