Removing duplicates from List

Question

I wrote this function to remove duplicates from a TList descendant, now i was wondering if this could give me problems in certain conditions, and how it does performance wise.

It seems to work with Object Pointers

function TListClass.RemoveDups: integer;
var
  total,i,j:integer;
begin
  total:=0;
  i := 0;
  while i < count do begin
    j := i+1;
    while j < count do begin
      if items[i]=items[j] then begin
       remove(items[j]);
       inc(total);
      end
      else
        inc(j);
    end;
    inc(i);
  end;
  result:=total;
end;

Update: Does this work faster?

function TDrawObjectList.RemoveDups: integer;
var
  total,i,j:integer;
  templist:TLIST;
begin
  templist:=TList.Create;
  total:=0;
  i := 0;
  while i < count do
    if templist.IndexOf(items[i])=-1 then begin
      templist.add(i);
      inc(i);
    end else begin
      remove(items[i]);
      inc(total);
    end;
  result:=total;
  templist.Free;
end;

You do need another List.

The runtime is quadratic, so unless your lists are small (<1000 elements) , it will be *really slow*. — Niklas B., Oct 23 '12 at 09:01
@NiklasB. Items will no exceed 1000 in the current use. But they might in the future, How would you go to make it non quadratic? — r_j, Oct 23 '12 at 09:06
is it possible to check duplicates while adding new values to list? — teran, Oct 23 '12 at 10:07
just a thought: if you disable adding duplicates in the first place (same as `TStringList.Duplicates = dupIgnore`) you will not have such problem. — kobik, Oct 23 '12 at 10:07
why so ? it would still be quadratic, just moved to population stage — Arioch 'The, Oct 23 '12 at 10:09
First Sort the list, with List.Sort, which uses quicksort, then go through the list in reverse order and compare current item to the one before in the list, if they're the same, remove the current item. You need to make your own compare function though. — Pieter B, Oct 23 '12 at 11:03
That would require me to write the code. And I don't have time for that at the moment. — Pieter B, Oct 23 '12 at 11:57
@David teran and kobik did not speak about sorting, only about duplicate elimination during Insert calls. Which, on unsorted list, would be linear for each element, again. So in effect that was spreading the load, not reducing it. — Arioch 'The, Oct 23 '12 at 13:20
@Arioch'The, even if the list is not sorted, checking for duplicates when adding, eliminates the need for *removing* duplicate elements from the list (which is quite painful). — kobik, Oct 23 '12 at 13:28
@kobik - i agree. Yet that is still not reducing the load but redistributing it, the task still would be quadratic, just the load would be more smooth. OTOH it is possible that early elimination of dups would break some other logic, who knows. — Arioch 'The, Oct 23 '12 at 13:40
@Arioch'The, note that `Duplicates = dupIgnore` does nothing if the list is not sorted. Also as I said removing elements from a list vs. checking duplicates while constructing the list costs about 2 times more (tested). — kobik, Oct 23 '12 at 13:55
that probably depends on TList implementation wit hnaive array-based implementation and deleting with forward loop (instead of `downto`) it would be damn lot of data copying. Upi to the point where re-constructing list clone might be better than deleting items :-) — Arioch 'The, Oct 23 '12 at 14:15
@Arioch'The, when you address someone via comments, please use the @[username] so that the user gets notified. — kobik, Oct 23 '12 at 14:25

Kromster · Answer 1 · 2012-10-23T11:47:41.617

1

As noted, the solution is O(N^2) which makes it really slow on a big set of items (1000s), but as long as the count stays low it's the best bet because of it's simplicity and easiness to implement. Where's pre-sorted and other solutions need more code and prone to implementation errors more.

This maybe the same code written in different, more compact form. It runs through all elements of the list, and for each removes duplicates on right of the current element. Removal is safe as long as it's done in a reverse loop.

function TListClass.RemoveDups: Integer;
var
  I, K: Integer;
begin
  Result := 0;
  for I := 0 to Count - 1 do //Compare to everything on the right
  for K := Count - 1 downto I+1 do //Reverse loop allows to Remove items safely
    if Items[K] = Items[I] then
    begin
      Remove(Items[K]);
      Inc(Result);
    end;
end;

I would suggest to leave optimizations to a later time, if you really end up with a 5000 items list. Also, as noted above, if you do check for duplicates on adding items to the list you can save on:

Check for duplicates gets distributed in time, so it wont be as noticeable to user
You can hope to quit early if dupe is found

edited Oct 23 '12 at 11:47

answered Oct 23 '12 at 10:11

Kromster

7,181
7
63
111

1

N + (N-1) + (N-2) + ... + (N-N+1) = N^2 - (1+2+3+...N) = N^2 - N*(1+N)/2 = N^2/2 - N/2 = (1/2)*(N^2). Classic quadratic asymptote. – Arioch 'The Oct 23 '12 at 10:58
Please stop providing quadratic functions to remove duplicates from a list. They're hidden predators. They compare each item to all other items in the list, so at 100 items, you're looking at 10.000 comparissons, at 1.000 items at a million and at 5.000 at 25 million comparissons, which is when your program will slow down to a crawl. – Pieter B Oct 23 '12 at 11:33
Thanks Arioch, fixed my phrase – Kromster Oct 23 '12 at 11:33
1

@PieterB: We have a job to handle 100 items, and we need to solve it now, we can spend 5min and do O(N^2), or we can spend 2 hours and do O(N log N). Question is, do we really need to waste 2 hours on an each case with 10..100..500 items, to get a less maintainable and more buggy result? Probably not. What we do need though is rely on profiling results and optimize *what matters* and *when it matters*. – Kromster Oct 23 '12 at 11:38
1

@Krom It's far easier and a lot safer to implement it in the right way the first time you do it, then in a years time when you have to hack around it, when your customer asks why his application is slowing down. Especially in such clear cut and well known cases. – Pieter B Oct 23 '12 at 11:54
1

@PieterB: When you get a call from a customer, you profile the code and optimize that single isolated function `RemoveDups` in any fashion you like. Short lists are much more common than 1000s ones. Sadly OP does not mentions how many items he has in a list and if the performance is really a problem. – Kromster Oct 23 '12 at 12:05
@David if you want to preserve order, before you do the sorting add an index to the list and after sorting you can order it to that index again. This indeed gets a bit more complex then needed for very small lists though. – Pieter B Oct 23 '12 at 12:42
@Pieter OK. That's n log n isn't it. – David Heffernan Oct 23 '12 at 12:45
Thnx for the answer and interresting discussion. the list will be between 5 to30 and with a estimated max of 300 items. so for now this solution is best. But the question was answered that when my list gets bigger then 1000 it will lose performance fast. – r_j Oct 23 '12 at 12:47
@David, that depends on your choice of sorting, but with quicksort, most of the times it is. – Pieter B Oct 23 '12 at 12:49

score 1 · Answer 2 · answered Oct 23 '12 at 15:11

Just hypothetical:

Interfaces

If you have interfaced objects in an TInterfaceList that are only in that list, you could check the refcount of an object. Just loop through the list backwards and delete all objects with a refcount > 1.

Custom counter

If you can edit these objects, you could do the same without interfaces. Increment a counter on the object when they are added to the list and decrease it when they are removed.

Of course, this only works if you can actually add a counter to these objects, but the boundaries weren't exactly clear in your question, so I don't know if this is allowed.

Advantage is that you don't need to look for other items, not when inserting, not when removing duplicates. Finding a duplicate in a sorted list could be faster (as mentioned in the comments), but not having to search at all will beat even the fastest lookup.

Removing duplicates from List

2 Answers2