1

I have the following data:

A = [1 2 ; 3 2; 4 7; 10 2; 6 7; 10 9]
B = [1 2 3; 4 4 9; 1 8 0; 3 7 9; 3 6 8]
C = [4; 10; 6; 3; 1]

A =
    1     2
    3     2
    4     7
   10     2
    6     7
   10     9

B =
    1     2     3
    4     4     9
    1     8     0
    3     7     9
    3     6     8

C.' =
    4    10     6     3     1

For each unique value in A(:,2) I need to take the corresponding values in A(:,1), look for their value in C, then take the relevant rows in B and compute their mean. The result should be length(unique(A(:,2)) x size(B,2);

The expected result for this example:

  • Value "2": mean of rows 2, 4 and 5 from B Explanation: Indices 1, 3 and 10 that correspond to value "2" in A are at indices 2, 4, 5 in C.

Correspondingly:

  • Value "7": mean of rows 1 and 3 from B.
  • Value "9": mean of row 2 from B.

I compute it now by applying unique on A and iterating each value, searching the right indices. My data set is quite large, so it takes quite a time. How can I avoid the loops?

Mohsen Nosratinia
  • 9,844
  • 1
  • 27
  • 52
matlabit
  • 838
  • 2
  • 13
  • 31
  • 2
    After reading it multiple times I now think I know what you want. However I doubt whether loops can be avoided alltogether. Please show your code and the profiling report(`help profile`). Also describe typical variable sizes and values, then we can probably help you to find weak points. – Dennis Jaheruddin Sep 02 '13 at 08:56
  • 1
    @DennisJaheruddin They _can_ be avoided, see my answer... ;) – Eitan T Sep 02 '13 at 08:57

3 Answers3

6

Let's do what you say in the question step by step:

  1. For each unique value in A(:, 2):

    [U, ia, iu] = unique(A(:, 2));
    
  2. Take the corresponding values in A(:, 1) and look for their value in C:

    [tf, loc] = ismember(A(:, 1), C);
    

    It's also recommended to make sure, just in case, that all values are actually found in C:

    assert(all(tf))
    
  3. Then take the relevant rows in B and compute their mean:

    [X, Y] = meshgrid(1:size(B, 2), iu);
    result = accumarray([Y(:), X(:)], reshape(B(loc, :), 1, []), [], @mean);
    

Hope this helps! :)

Example

%// Sample input
A = [1 2 ; 3 2; 4 7; 10 2; 6 7; 10 9];
B = [1 2 3; 4 4 9; 1 8 0; 3 7 9; 3 6 8];
C = [4; 10; 6; 3; 1];

%// Compute means
[U, ia, iu] = unique(A(:, 2));
[tf, loc] = ismember(A(:, 1), C);
[X, Y] = meshgrid(1:size(B, 2), iu);
result = accumarray([Y(:), X(:)], reshape(B(loc, :), [], 1), [], @mean);

The result is:

result = 
   3.3333   5.6667   8.6667
   1.0000   5.0000   1.5000
   4.0000   4.0000   9.0000
Eitan T
  • 32,660
  • 14
  • 72
  • 109
3

Here is another solution without arrayfun and accumarray using good old-fashion matrix multiplication:

r = bsxfun(@eq, A(:,1), C')*(1:numel(C))'; 
[~,m,n] = unique(A(:,2));
f=histc(n, 1:numel(m));
result = diag(1./f)*bsxfun(@eq, 1:numel(m), n).'*B(r,:);

I ran a benchmark against other two solutions and it appears to be faster than both. For 1000 repetitions:

  • This method takes 0.205650 seconds.
  • Eitan T's solution takes 0.546976 seconds.
  • matlabit's solution takes 1.619039 seconds.

Here is the benchmark code:

N = 1e3;

tic
for k=1:N,
    r = bsxfun(@eq, A(:,1), C')*(1:numel(C))'; % faster than [~,r] = ismember(A(:,1), C)
    [~,m,n] = unique(A(:,2));
    f=histc(n, 1:numel(m));
    result2 = diag(1./f)*bsxfun(@eq, 1:numel(m), n).'*B(r,:);
end
toc

tic
for k=1:N,
    [U, ia, iu] = unique(A(:, 2));
    [tf, loc] = ismember(A(:, 1), C);
    [X, Y] = meshgrid(1:size(B, 2), iu);
    result1 = accumarray([Y(:), X(:)], reshape(B(loc, :), [], 1), [], @mean);
end
toc

tic
for k=1:N,
    D = [arrayfun(@(x) find(C == x,1,'first'), A(:,1) ), A(:,2)];
    data = [B(D(:,1),:), D(:,2)];
    st = grpstats(data(:,1:3),data(:,4:4),{'mean'});
end
toc
Mohsen Nosratinia
  • 9,844
  • 1
  • 27
  • 52
  • +1: Interesting method indeed. Did you happen to benchmark it for larger data sets as well? Also note that with this approach you can only compute the mean value and not apply an arbitrary function as with `accumarray`. – Eitan T Sep 03 '13 at 10:07
1

Thanks, I also thought of:

D = [arrayfun(@(x) find(C == x,1,'first'), A(:,1) ), A(:,2)];
data = [B(D(:,1),:), D(:,2)];
st = grpstats(data(:,1:3),data(:,4:4),{'mean'});
matlabit
  • 838
  • 2
  • 13
  • 31
  • 1
    I fear that for large data sets `arrayfun` might be rather slow. Also, this requires the statistics toolbox, so this solution might not always be applicable. – Eitan T Sep 02 '13 at 09:09
  • Nice that you posted this answer after finding it. However, note that arrayfun can sometimes behave like a loop so it may not be much faster. The only way to really know which is best for you is to try both and find out! – Dennis Jaheruddin Sep 02 '13 at 09:16
  • @matlabit I benchmarked your this in my solution and it is significantly slower. – Mohsen Nosratinia Sep 03 '13 at 07:02