5

I've got an array of bytes (primitive), they can have random values. I'm trying to count occurrences of them in the array in the most efficient/fastest way. Currently I'm using:

HashMap<Byte, Integer> dataCount = new HashMap<>();
for (byte b : data) dataCount.put(b, dataCount.getOrDefault(b, 0) + 1);

This one-liner takes ~500ms to process a byte[] of length 24883200. Using a regular for loop takes at least 600ms.

I've been thinking of constructing a set (since they only contain one of each element) then adding it to a HashMap using Collections.frequency(), but the methods to construct a Set from primitives require several other calls, so I'm guessing it's not as fast.

What would be the fastest way to accomplish counting of occurrences of each item?

I'm using Java 8 and I'd prefer to avoid using Apache Commons if possible.

Cœur
  • 37,241
  • 25
  • 195
  • 267
user_4685247
  • 2,878
  • 2
  • 17
  • 43

2 Answers2

15

If it's just bytes, use an array, don't use a map. You do have to use masking to deal with the signedness of bytes, but that's not a big deal.

int[] counts = new int[256];
for (byte b : data) {
   counts[b & 0xFF]++;
}

Arrays are just so massively compact and efficient that they're almost impossible to beat when you can use them.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • this would work, but it'd also allocate memory for values that don't occur, I'll need the HashMap of these values later on, without the 0 values of course. – user_4685247 May 06 '15 at 17:33
  • Seems the increase is so big I can afford creating a later copy to HashMap! – user_4685247 May 06 '15 at 17:36
  • 5
    An `int[]` is _so_ much more compact than a `HashMap` that the "memory for values that don't occur" is almost certainly paid for by not using a `HashMap`. Depending on how large the counts are, the `int[256]` is better if you have even ~20 distinct bytes, and all the other 236 values are 0. – Louis Wasserman May 06 '15 at 17:36
8

I would create an array instead of a HashMap, given that you know exactly how many counts you need to keep track of:

int[] counts = new int[256];
for (byte b : data) {
    counts[b & 0xff]++;
}

That way:

  • You never need to do any boxing of either the keys or the values
  • Nothing needs to take a hash code, check for equality etc
  • It's about as memory-efficient as it gets

Note that the & 0xff is used to get a value in the range [0, 255] instead of [-128, 127], so it's suitable as the index into the array.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194