Merging huge sets (HashSet) in Scala

Question

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).

Currently, I am adding all items of one set to the other with:

setOne ++= setTwo

This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).

Any ideas how to speed things up?

What do you do with the merged set afterwards? What operations and how many? (I'm thinking you could take a lazy approach and not bother merging the sets at all if there a small number of things you will do with it - just do the op on one or both sets as appropriate) — The Archetypal Paul, Aug 03 '11 at 12:53
Do you know if the performance is impacted by memory heap size? Sometimes when the JVM runs out of heap, performance is degraded as the garbage collector spend all its time reclaiming memory. — huynhjl, Aug 03 '11 at 14:07
@huynhjl Your point is valid. I did run into severe degradation initially, but fixed it by applying flyweight to the set members. — Alexandros, Aug 04 '11 at 04:53
@Paul: in this particular case I just save the sets to a text file, so this will work. However, the point of this question is to actually find ways to improve the merging of large sets.... — Alexandros, Aug 04 '11 at 04:59

Vasil Remeniuk · Accepted Answer · 2011-08-03T13:31:58.267

5

You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:

setOne.par ++ setTwo

or

(setOne.par /: setTwo)(_ + _)

edited Aug 03 '11 at 13:31

answered Aug 03 '11 at 13:13

Vasil Remeniuk

20,519
6
71
81

Daniel C. Sobral · Answer 2 · 2011-08-03T14:11:56.023

2

There are a few things you might wanna try:

Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.

It seems to me that the latter option gives better results, though both show improvements on tests here.

edited Aug 03 '11 at 14:11

answered Aug 03 '11 at 13:58

Daniel C. Sobral

295,120
86
501
681

That's generally useful. Unfortunately I'm doing a brute-force search and have no idea what the size of the individual sets will be; at least not until I've calculated them... – Alexandros Aug 03 '11 at 20:18
@Alexandros You could always call `size` on each collection and estimate the size of the merge. Or use `useSizeMap`, which does not require you to tell it anything. – Daniel C. Sobral Aug 03 '11 at 22:17

score 0 · Answer 3 · answered Aug 03 '11 at 16:09

Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:

If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.

I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Merging huge sets (HashSet) in Scala

3 Answers3