remove duplicates in column 1 of array by retaining only that entry in column 1 that has maximum value in column 2

Question

I have a n X 2 matrix which has been formed by appending many matrices together. Column 1 of the matrix consists of numbers that indicate item_ids and column 2 consists of similarity values. Since this matrix has been formed by concatenating many matrices together, there might exist duplicate values in column 1 which I do not want. I would like to remove all the duplicate values in column 1 such that for any value X in column 1 of which there are duplicates, all the rows of the matrix are removed in which column 1 = X , except that row of the matrix where column 1 = X and column2 value is the maximum among all the values for X in the matrix.

Example:

  1    0.85
  1    0.5
  1    0.95
  2    0.5

result required:
    1 0.95
    2 0.5

obtained by removing all the rows in the n X 2 matrix where the duplicate values in column 1 did not have the maximum value in column 2.

chappjc · Accepted Answer · 2014-02-26T04:58:27.213

2

If you might have gaps in the index, use sparse output:

>> result = accumarray( M(:,1), M(:,2), [], @max, 0, true)
>> uMat = [find(result) nonzeros(result)]
uMat =
    1.0000    0.9500
    2.0000    0.5000

This also simplifies creation of the first column of the output.

A couple of other ways to do it with unique.

First way, use sort with 'descend' ordering:

>> [~,IS] = sort(M(:,2),'descend');
>> [C,ia] = unique(M(IS,1));
>> M(IS(ia),:)
ans =
    1.0000    0.9500
    2.0000    0.5000

Second, use sortrows (ascending sort by second column), and unique with 'first' occurrence option:

>> [Ms,IS] = sortrows(M,2)
>> [~,ia] = unique(Ms(:,1),'last')
>> M(IS(ia),:)
ans =
    1.0000    0.9500
    2.0000    0.5000

edited Feb 26 '14 at 04:58

answered Feb 26 '14 at 04:15

chappjc

30,359
6
75
132

Yep - after posting my original answer I had a chance to try some stuff and discovered that using sparse is the way to go. But you posted that before I could come back and update my answer... – Floris Feb 26 '14 at 04:21
I don't like the expanded answer as much as I liked the original. – Floris Feb 26 '14 at 05:20
@Floris Me neither, to be honest. I just wanted to put some other options out there. Would it have been smarter to put the extra stuff in a different answer? Sometimes clutter can be a negative. – chappjc Feb 26 '14 at 05:30
@chappjc, Floris, the sparse seems to be the way to go for me as well, its much more concise than my answer. Thanks to both of you for your answers. – anonuser0428 Feb 26 '14 at 05:57

Floris · Answer 2 · 2014-02-26T15:43:13.840

1

You can try

result = accumarray( M(:,1), M(:,2), [max(M(:,1)) 1], @max);

According to the documentation, that should work.

Apologies I can't try it out right now...

update - I did try the above, and it gave me the max values correctly. However it doesn't give you the indices corresponding to the max values. For that, you need to do a bit more work (since the identifiers probably aren't sorted).

result = accumarray( M(:,1), M(:,2), [], @max, true);  % to create a sparse matrix
c1 = find(result);     % to get the indices of nonzero values
c2 = full(result(c1)); % to get the values corresponding to the indices
answer = [c1 c2];      % to put them side by side

edited Feb 26 '14 at 15:43

answered Feb 26 '14 at 03:35

Floris

45,857
6
70
122

I don't think the solution works as written above but I will try looking into the accumarray function. Thanks for your guidance. – anonuser0428 Feb 26 '14 at 03:50
@user1009091 - as written my solution gives "half an answer". Using the `issparse` parameter as shown in chappjc's answer gets you the rest of the way. – Floris Feb 26 '14 at 04:23
I decided to finish the "update" I had started (I found it half-finished in another browser window) just to clear things up a bit. No need for sorting or `unique` after you use the sparse matrix approach. I think this is the most "matlab like" solution now. – Floris Feb 26 '14 at 15:45

anonuser0428 · Answer 3 · 2014-02-26T04:21:22.403

result = accumarray( M(:,1), M(:,2), [max(M(:,1)) 1], @max);

finalResult = [sort(unique(M(:,1))),nonzeros(result)]

This basically reattaches the required item_ids in sorted order to the corresponding max_similarity values in the second column. As a result in the finalResult matrix, each value in column 1 is unique and the corresponding value in column 2 is the maximum similarity value for that item_id. @Floris, thanks for your help couldn't have solved this without your help.

Luis Mendo · Answer 4 · 2014-02-26T11:57:55.630

0

Yet another approach: use sortrows and then diff to select the last row for each value of the first column:

M2 = sortrows(M);
result = M2(diff([M2(:,1); inf])>0,:);

This works also if the indices in the first column have gaps.

edited Feb 26 '14 at 11:57

answered Feb 26 '14 at 10:16

Luis Mendo

110,752
13
76
147

remove duplicates in column 1 of array by retaining only that entry in column 1 that has maximum value in column 2

4 Answers4