1

I have a n X 2 matrix which has been formed by appending many matrices together. Column 1 of the matrix consists of numbers that indicate item_ids and column 2 consists of similarity values. Since this matrix has been formed by concatenating many matrices together, there might exist duplicate values in column 1 which I do not want. I would like to remove all the duplicate values in column 1 such that for any value X in column 1 of which there are duplicates, all the rows of the matrix are removed in which column 1 = X , except that row of the matrix where column 1 = X and column2 value is the maximum among all the values for X in the matrix.

Example:

  1    0.85
  1    0.5
  1    0.95
  2    0.5

result required:
    1 0.95
    2 0.5 

obtained by removing all the rows in the n X 2 matrix where the duplicate values in column 1 did not have the maximum value in column 2.

chappjc
  • 30,359
  • 6
  • 75
  • 132
anonuser0428
  • 11,789
  • 22
  • 63
  • 86

4 Answers4

2

If you might have gaps in the index, use sparse output:

>> result = accumarray( M(:,1), M(:,2), [], @max, 0, true)
>> uMat = [find(result) nonzeros(result)]
uMat =
    1.0000    0.9500
    2.0000    0.5000

This also simplifies creation of the first column of the output.


A couple of other ways to do it with unique.

First way, use sort with 'descend' ordering:

>> [~,IS] = sort(M(:,2),'descend');
>> [C,ia] = unique(M(IS,1));
>> M(IS(ia),:)
ans =
    1.0000    0.9500
    2.0000    0.5000

Second, use sortrows (ascending sort by second column), and unique with 'first' occurrence option:

>> [Ms,IS] = sortrows(M,2)
>> [~,ia] = unique(Ms(:,1),'last')
>> M(IS(ia),:)
ans =
    1.0000    0.9500
    2.0000    0.5000
chappjc
  • 30,359
  • 6
  • 75
  • 132
  • Yep - after posting my original answer I had a chance to try some stuff and discovered that using sparse is the way to go. But you posted that before I could come back and update my answer... – Floris Feb 26 '14 at 04:21
  • I don't like the expanded answer as much as I liked the original. – Floris Feb 26 '14 at 05:20
  • @Floris Me neither, to be honest. I just wanted to put some other options out there. Would it have been smarter to put the extra stuff in a different answer? Sometimes clutter can be a negative. – chappjc Feb 26 '14 at 05:30
  • @chappjc, Floris, the sparse seems to be the way to go for me as well, its much more concise than my answer. Thanks to both of you for your answers. – anonuser0428 Feb 26 '14 at 05:57
1

You can try

result = accumarray( M(:,1), M(:,2), [max(M(:,1)) 1], @max);

According to the documentation, that should work.

Apologies I can't try it out right now...

update - I did try the above, and it gave me the max values correctly. However it doesn't give you the indices corresponding to the max values. For that, you need to do a bit more work (since the identifiers probably aren't sorted).

result = accumarray( M(:,1), M(:,2), [], @max, true);  % to create a sparse matrix
c1 = find(result);     % to get the indices of nonzero values
c2 = full(result(c1)); % to get the values corresponding to the indices
answer = [c1 c2];      % to put them side by side
Floris
  • 45,857
  • 6
  • 70
  • 122
  • I don't think the solution works as written above but I will try looking into the accumarray function. Thanks for your guidance. – anonuser0428 Feb 26 '14 at 03:50
  • @user1009091 - as written my solution gives "half an answer". Using the `issparse` parameter as shown in chappjc's answer gets you the rest of the way. – Floris Feb 26 '14 at 04:23
  • I decided to finish the "update" I had started (I found it half-finished in another browser window) just to clear things up a bit. No need for sorting or `unique` after you use the sparse matrix approach. I think this is the most "matlab like" solution now. – Floris Feb 26 '14 at 15:45
0
result = accumarray( M(:,1), M(:,2), [max(M(:,1)) 1], @max);

finalResult = [sort(unique(M(:,1))),nonzeros(result)]

This basically reattaches the required item_ids in sorted order to the corresponding max_similarity values in the second column. As a result in the finalResult matrix, each value in column 1 is unique and the corresponding value in column 2 is the maximum similarity value for that item_id. @Floris, thanks for your help couldn't have solved this without your help.

anonuser0428
  • 11,789
  • 22
  • 63
  • 86
0

Yet another approach: use sortrows and then diff to select the last row for each value of the first column:

M2 = sortrows(M);
result = M2(diff([M2(:,1); inf])>0,:);

This works also if the indices in the first column have gaps.

Luis Mendo
  • 110,752
  • 13
  • 76
  • 147