Efficient algorithm for pairwise comparison of elements

Question

Given an array with some key-value pairs:

[
  {'a': 1, 'b': 1},
  {'a': 2, 'b': 1},
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

I want to find an intersection of these pairs. Intersection means to leave only those elements, that can be covered by others, or unique. For example, {'a': 1, 'b': 1, 'c': 1} and {'a': 1, 'b': 1, 'c': 2} fully cover {'a': 1, 'b': 1}, while {'a': 2, 'b': 2} is unique. So, in

[
  {'a': 1, 'b': 1},
  {'a': 2, 'b': 1},
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

after finding the intersection should remain

[
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

I tried to iterate over all pairs and find covering pairs comparing with each other, but time complexity equals to O(n^2). Is it possible to find all covering or unique pairs in linear time?

Here is my code example (O(n^2)):

public Set<Map<String, Integer>> find(Set<Map<String, Integer>> allPairs) {
  var results = new HashSet<Map<String, Integer>>();
  for (Map<String, Integer> stringToValue: allPairs) {
    results.add(stringToValue);
    var mapsToAdd = new HashSet<Map<String, Integer>>();
    var mapsToDelete = new HashSet<Map<String, Integer>>();
    for (Map<String, Integer> result : results) {
      var comparison = new MapComparison(stringToValue, result);
      if (comparison.isIntersected()) {
        mapsToAdd.add(comparison.max());
        mapsToDelete.add(comparison.min());
      }
    }
    results.removeAll(mapsToDelete);
    results.addAll(mapsToAdd);
  }
  return results;
}

where MapComparison is:

public class MapComparison {

    private final Map<String, Integer> left;
    private final Map<String, Integer> right;
    private final ComparisonDecision decision;

    public MapComparison(Map<String, Integer> left, Map<String, Integer> right) {
        this.left = left;
        this.right = right;
        this.decision = makeDecision();
    }

    private ComparisonDecision makeDecision() {
        var inLeftOnly = new HashSet<>(left.entrySet());
        var inRightOnly = new HashSet<>(right.entrySet());

        inLeftOnly.removeAll(right.entrySet());
        inRightOnly.removeAll(left.entrySet());

        if (inLeftOnly.isEmpty() && inRightOnly.isEmpty()) {
            return EQUALS;
        } else if (inLeftOnly.isEmpty()) {
            return RIGHT_GREATER;
        } else if (inRightOnly.isEmpty()) {
            return LEFT_GREATER;
        } else {
            return NOT_COMPARABLE;
        }
    }

    public boolean isIntersected() {
        return Set.of(LEFT_GREATER, RIGHT_GREATER).contains(decision);
    }

    public boolean isEquals() {
        return Objects.equals(EQUALS, decision);
    }

    public Map<String, Integer> max() {
        if (!isIntersected()) {
            throw new IllegalStateException();
        }
        return LEFT_GREATER.equals(decision) ? left : right;
    }

    public Map<String, Integer> min() {
        if (!isIntersected()) {
            throw new IllegalStateException();
        }
        return LEFT_GREATER.equals(decision) ? right : left;
    }

    public enum ComparisonDecision {
        EQUALS,
        LEFT_GREATER,
        RIGHT_GREATER,
        NOT_COMPARABLE,

        ;
    }
}

I'm not sure this can be done in linear time but if you first sort your data it might be doable in O(n*log(n)) — Thomas, Sep 06 '21 at 13:44
Put it another way, are you looking to remove all elements that are fully a subset of another element, or something else? — Mad Physicist, Sep 06 '21 at 13:47
@MadPhysicist, yes, in other words, I need to remove all elements, that are fully intersect (subset) with any of elements. Nothing else. — Andrei Levin, Sep 06 '21 at 13:55
@MadPhysicist, `[{a:1, b:2}, {a:1, c:3}, {b:2, c:3}]` - nothing will be remove, because they are all unique. — Andrei Levin, Sep 06 '21 at 13:56
My first idea would be to find a way to **sort** the list, in a way that would guarantee that if an element covers another, then they are adjacent. (I don't know for sure that such a way to sort the list exists, but that would be convenient) — Stef, Sep 06 '21 at 14:00
Sadly, the accepted answer to the almost-duplicate [Efficient algorithm to find the maximal elements of a partially ordered set](https://stackoverflow.com/questions/21560659/efficient-algorithm-to-find-the-maximal-elements-of-a-partially-ordered-set) says *"It seems the worst case is O(n^2) no matter what you do."* — Stef, Sep 06 '21 at 14:10
Relevant keywords: the sublist you are trying to compute is called the **pareto front**, in the domain of multi-objective optimization. — Stef, Sep 06 '21 at 14:16
I wonder if treating each element as a polynomial (assuming each key-value pairing can be uniquely hashed) would allow one to find intersections with polynomial arithmetic. Each pairing in the element is the nth order coefficient. However, more clarity on the problem set is required - e.g. is `{a:1, b:2}` equivalent to `{b:2, a:1}` - does `{a:1, c:1, d:1, b:1}` contain `{a:1, b:1}`. I recommend making your input set more comprehensive. — , Sep 06 '21 at 14:17
I feel like union-find might actually be a close approximation of this problem. (Well at least the find part of the algorithm) which is O(log*(n)). One could start by using Sets with the lowest amount of elements and use these as elements for the "Find" algorithm. This would imo result in the same time complexity as @Thomas answer. I don't think one can go any faster, tho this might be up for debate. Upvoting the question tho because algorithms are always fun. Edit: According to https://cstheory.stackexchange.com/a/41388/62830 it is impossible to do this in O(n) — SirHawrk, Sep 06 '21 at 14:22
Perhaps https://javadoc.io/doc/io.jenetics/jenetics.ext/latest/io/jenetics/ext/moea/ParetoFront.html might be fast enough for the OP's purpose? — Stef, Sep 06 '21 at 14:26
I don't know about java, but the accepted answer for [Fast calculation of Pareto front in Python](https://stackoverflow.com/a/40239615/3080723) solves the problem with 10,000 arrays and 15 key-values per array, in 4 seconds. Would that be efficient enough for you? — Stef, Sep 06 '21 at 14:34
In your first example, why is {a:1, b:1} not in the output? Isn’t that the intersection of {a:1, b:1} and {a:1, b:1, c:1}? Or do intersections specifically require some existing value shared between the two to be different? — templatetypedef, Sep 06 '21 at 16:12
@templatetypedef The OP is not asking to produce intersections; it's asking to filter out subsets of other sets. — Stef, Sep 06 '21 at 17:43

kaya3 · Answer 1 · 2021-09-07T10:14:20.507

1

Here's an algorithm which may be better or worse, depending on the shape of the data. Let's simplify the problem by representing the input rows as sets instead of maps, because essentially you're only treating those maps as sets of pairs/entries. The problem is equivalent if the sets are like [a1, b1] and so on. The goal is to make a linear time algorithm assuming the lengths of the input rows are short. Let n be the number of input rows, and k be the maximum length of a row; our assumption is that k is much smaller than n.

Use a counting sort to sort the rows by length.
Initialise an empty HashSet for the result, where the members of the set will be rows (you will need an immutable, hashable class to represent the rows).
For each row:
- Remove each subset in the row's power set from the result, if it is present.
- Add the row to the result.

Since the rows are sorted by length, it is guaranteed that if row i is a subset of row j then row i would have been added before row j, and hence will later be correctly removed from the result set. Once the algorithm terminates, the result set contains exactly those input rows which are not subsets of any other input row.

The time complexity of the counting sort is O(n + k). Each power set has size at most 2^k, and each member of the power set has length at most k so that each HashSet operation is O(k) time. So the time complexity of the rest of the algorithm is O(2^k·kn), and this dominates the counting sort.

So the overall time complexity is O(n) if we treat k as a constant. If not, then this algorithm will still be asymptotically better than the naive O(n²·k) algorithm* when k < log₂ n.

^{*Note that the naive algorithm is O(n²·k) and not O(n²), because each comparison between two rows takes O(k) time.}

edited Sep 07 '21 at 10:14

answered Sep 07 '21 at 00:58

kaya3

47,440
4
68
97

Technically, the maps are treated as multisets. – Stef Sep 07 '21 at 07:12
And the distinction does matter if you make the assumption k << n (for a multiset, would k be the number of distinct elements or the total number of elements? ie, the length or the sum?) – Stef Sep 07 '21 at 07:27
@Stef I don't follow - how can a map be like `{a: 1, a: 1}`? I've never seen such a map and the question doesn't suggest the input could be like this. – kaya3 Sep 07 '21 at 09:16
What? I have no idea what you're talking about in your last comment? – Stef Sep 07 '21 at 09:21
My comment *"the maps are treated as multisets"* was in reaction to *"Let's simplify the problem by representing the input rows as sets instead of maps, because essentially you're only treating those maps as sets of entries. "*. Actually the maps are not treated as sets, but as multisets. For instance, `{'a': 2, 'b': 1}` is the multiset that contains twice 'a' and once 'b'. Multisets are not very different from sets, and in particular they also have an "is subset" relation. – Stef Sep 07 '21 at 09:23
2

@Stef The maps are treated as sets like `{a2, b1}`, i.e. sets of pairs, sets of map entries. Note how in the OP's example, `{'a': 1, 'b': 1, 'c': 1}` is not "covered" by `{'a': 2, 'b': 1, 'c': 2}` according to the expected output. – kaya3 Sep 07 '21 at 09:29
Oh. Ooooooh. I had completely misunderstood the problem. – Stef Sep 07 '21 at 09:31

גלעד ברקן · Answer 2 · 2021-09-08T15:21:08.170

Assume each element in the list is unique. (An element is the object with key-value pairs.) For each unique key-value pair, store the set of list elements that contain it. Iterate over the elements in order of increasing size. For each element, search through it's key-value pairs by looking up the set of elements that contain them and intersecting that set with the current intersection. If the intersection size gets lower than 2 (the intersection is assumed to contain at least one element, which is the one we're investigating), exit early. Depending on the data, we could possibly use bitsets for those sets (each bit would represent the index of the map element in the sorted list), which could perform intersections with parallel comparisons. Also depending on the data, the intersections can reduce the search space significantly.

Python code:

import collections

def f(lst):
  pairs_to_elements = collections.defaultdict(set)

  for i, element in enumerate(lst):
    for k, v in element.items():
      pairs_to_elements[(k, v)].add(i)

  lst_sorted_by_size = sorted(lst, key=lambda x: len(x))

  result = []

  for element in lst_sorted_by_size:
    pairs = list(element.items())
    intersection = pairs_to_elements[pairs[0]]
    is_contained = True

    for i in range(1, len(pairs)):
      intersection = intersection.intersection(pairs_to_elements[pairs[i]])
      if len(intersection) < 2:
        is_contained = False
        break

    if not is_contained:
      result.append(element)

  return result

Output:

lst = [
  {'a': 1, 'b': 1},
  {'a': 2, 'b': 1},
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

for element in f(lst):
  print(element)

"""
{'a': 2, 'b': 2}
{'a': 1, 'b': 1, 'c': 1}
{'a': 1, 'b': 1, 'c': 2}
{'a': 2, 'b': 1, 'c': 1}
{'a': 2, 'b': 1, 'c': 2}
"""

Efficient algorithm for pairwise comparison of elements

2 Answers2