0

I have a set of half a million items stored in the database and need the following operations:

  • union(x, y) just like in Union-Find
  • findAll(x) finding all y such that find(x) == find(y)
  • ununion(x, y) reverting a former union operation

This is a practical problem, for which the following is known

  • The partitions will be typically small (less than 100 elements), but there's no guarantee.
  • The speed of union operation doesn't matter much.
  • findAll has to be fast and needs to be implemented in SQL (without recursion / CONNECT BY).
  • Sometimes, we find out, that some union was actually wrong and need to undo it, while keeping all the previous and following unions. This operation is rare enough, so the speed doesn't matter.
  • It's not necessary that findAll sees changes done by the other operations immediately. Some post-processing would be OK.

The classical Union-Find algorithm needs path compression (or a variant) for efficiency and allows no edge deletions (even without path compression). I'm aware of Dynamic connectivity, but it looks like non-applicable to my use case.

I guess, we can't use it, as the speed of findAll is the most important. Probably, we should link all nodes to the root directly.

Concerning ununion, my only idea is to store all union operations separately, and on ununion, remove all links from the corresponding partition and redo all related unions.

This sounds rather brute-force like...


Before starting implementing anything, I'm asking if there's smarter algorithm?

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • From the 3 operations you require, which do not include `find`, the obvious implementation is to have `union(x,y)` create an edge between `x` and `y`, and `findAll(x)` do a BFS. Are you really sure you don't want to do any kind of recursion in the SQL? – Matt Timmermans Dec 16 '18 at 23:23
  • @MattTimmermans I'm using MySQL which has no CONNECT BY and Hibernate, so I guess, I really want no recursion. – maaartinus Dec 17 '18 at 00:24

1 Answers1

0

I would recommend that you create a hash function for sets, and store it with each x in the database in an indexed column. On constructing a union(x, y) you can calculate the new hash value and store it. On findAll the use of this index means that you will be only comparing sets with a good chance of being the same, which makes it reasonably fast. And there is nothing fancy in the lookup, so you can reasonably implement it in naive SQL.

btilly
  • 43,296
  • 3
  • 59
  • 88