13

I have this code that generates a HashSet and calls removeAll() on it. I made a class A which is just a wrapper of an int, which records the number of times equals is called - the program outputs that number.

import java.util.*;

class A {
    int x;
    static int equalsCalls;

    A(int x) {
        this.x = x;
    }

    @Override
    public int hashCode() {
        return x;
    }

    @Override
    public boolean equals(Object o) {
        equalsCalls++;
        if (!(o instanceof A)) return false;
        return x == ((A)o).x;
    }

    public static void main(String[] args) {
        int setSize = Integer.parseInt(args[0]);
        int listSize = Integer.parseInt(args[1]);
        Set<A> s = new HashSet<A>();
        for (int i = 0; i < setSize; i ++)
            s.add(new A(i));
        List<A> l = new LinkedList<A>();
        for (int i = 0; i < listSize; i++)
            l.add(new A(i));
        s.removeAll(l);
        System.out.println(A.equalsCalls);
    }
}

It turns out that the removeAll operation is not linear as I would have expected:

$ java A 10 10
55
$ java A 100 100
5050
$ java A 1000 1000
500500

In fact, it seems to be quadratic. Why?

Replacing the line s.removeAll(l); with for (A b : l) s.remove(b); fixes it to act the way I'd expect:

$ java A 10 10
10
$ java A 100 100
100
$ java A 1000 1000
1000

Why?

Here's a graph showing the quadratic curve:

enter image description here

The x and y axes are the two arguments to the Java program. The z axis is the number of A.equals calls.


The graph was generated by this Asymptote program:

import graph3;
import grid3;
import palette;

currentprojection=orthographic(0.8,1,1);

size(400,300,IgnoreAspect);

defaultrender.merge=true;

real[][] a = new real[][]{
  new real[]{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
  new real[]{0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1},
  new real[]{0,1,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3},
  new real[]{0,1,2,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6},
  new real[]{0,1,2,3,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10},
  new real[]{0,1,2,3,4,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15},
  new real[]{0,1,2,3,4,5,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21},
  new real[]{0,1,2,3,4,5,6,28,28,28,28,28,28,28,28,28,28,28,28,28,28},
  new real[]{0,1,2,3,4,5,6,7,36,36,36,36,36,36,36,36,36,36,36,36,36},
  new real[]{0,1,2,3,4,5,6,7,8,45,45,45,45,45,45,45,45,45,45,45,45},
  new real[]{0,1,2,3,4,5,6,7,8,9,55,55,55,55,55,55,55,55,55,55,55},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,66,66,66,66,66,66,66,66,66,66},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,78,78,78,78,78,78,78,78,78},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,91,91,91,91,91,91,91,91},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,105,105,105,105,105,105,105},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,120,120,120,120,120,120},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,136,136,136,136,136},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,153,153,153,153},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,171,171,171},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,190,190},
  new real[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,210},
};


surface s=surface(a,(-1/2,-1/2),(1/2,1/2),Spline);
draw(s,mean(palette(s.map(zpart),Rainbow())),black);
//grid3(XYZgrid);

The array a was generated by this Haskell program:

import System.Process
import Data.List

f :: Integer -> Integer -> IO Integer
f x y = fmap read $ readProcess "/usr/bin/java" ["A", show x, show y] ""

g :: [[Integer]] -> [String]
g xss = map (\xs -> "new real[]{" ++ intercalate "," (map show xs) ++ "},") xss

main = do
  let n = 20
  xs <- mapM (\x -> mapM (\y -> f x y) [0..n]) [0..n]
  putStrLn $ unlines $ g xs
Dog
  • 7,707
  • 8
  • 40
  • 74

1 Answers1

11

The time it takes for removeAll to work depends on what kind of collection you pass. When you look at the implementation of removeAll:

public boolean removeAll(Collection<?> c) {
    boolean modified = false;

    if (size() > c.size()) {
        for (Iterator<?> i = c.iterator(); i.hasNext(); )
            modified |= remove(i.next());
    } else {
        for (Iterator<?> i = iterator(); i.hasNext(); ) {
            if (c.contains(i.next())) {
                i.remove();
                modified = true;
            }
        }
    }
    return modified;
}

one can see that when the HashSet and the collection have the same size, it iterates over the HashSet and calls c.contains for each element. Since you are using a LinkedList as the argument, this is an O(n) operation for each element. Since it needs to do this for each of the n elements being removed, the result is an O(n2) operation.

If you replace the LinkedList with a collection that offers a more efficient implementation of contains, then the performance of removeAll should improve accordingly. If you make the HashSet larger than the collection, performance should also improve dramatically, since it will then iterate over the collection.

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • 2
    Why is `removeAll` doing all this? I thought `s.removeAll(c)` was just shorthand for `for (T x : c) s.remove(x);`. Is this documented or implementation dependent? – Dog Jul 25 '13 at 02:59
  • @Dog - It's not documented, so I'm sure it is implementation dependent. My guess is that checking the relative sizes of the collections is a heuristic that works well under the assumption that the two collections (the argument and `this`) have similar complexity for `contains`. The heuristic makes a lot of sense when the two collections have very different sizes. Your test happens to use a collection type that violates both the assumption and the conditions for the heuristic to be useful. – Ted Hopp Jul 25 '13 at 03:56
  • Actually, I think it might be specified behavior. In the HashSet API docs, it has `removeAll` in "Methods inherited from class java.util.AbstractSet", and in `AbstractSet`, it says what you said. Does the "Methods inherited from class" count as part of the spec? Or was that automatically generated from the code of an implementation? – Dog Jul 26 '13 at 14:59
  • @Dog - Wow. I never noticed that this was actually documented behavior. It certainly seems to be part of the specification for how `AbstractSet` behaves and, since `HashSet` does not override that method, it inherits the requirement to behave like that. However, the documentation starts with "_This_ implementation determines..." (emphasis added), so I'll leave it to the more legally minded to decide whether this is part of the specification of the method or a (non-binding) side comment. I think this was written before there was an awareness of how important the exact wording of the spec is. – Ted Hopp Jul 26 '13 at 15:33
  • @TedHopp the _specification_ for `removeAll()` is given by the `Set` interface, and is just "Removes from this set all of its elements that are contained in the specified collection". The comment on `AbstractSet` then explains how that specification is implemented by this particular class. – Ian Roberts Jul 27 '13 at 10:05
  • @IanRoberts - Well, it's a matter of debate whether it's a comment or an extension of the spec. A similar problem occurs in `String.hashCode()`: the "spec" is given by `Object.hashCode()`, but `String.hashCode()` specifies _how_ the hash code is computed for `String` objects. This is generally considered as an extension of the spec; people have relied on being able to reproduce the hash code value of a string in their own code. It's also widely considered to have been a mistake to document the internal operation of `String.hashCode()` that way, but it's too late to do anything about it now. – Ted Hopp Jul 28 '13 at 01:34