3

I a vector of vectors, each representing a a set (in the mathematical sense). For example:

{{1, 3}, {4, 9, 14}, {1, 3}, {1, 4, 8, 9, 10, 14, 16}, {1, 3, 9}, {4, 9, 17, 22}}

I want to make the most efficient C++ possible function capable of filtering (in place, if possible) the vector in order to remove every item that contains another.

For example, here:

  • {1, 3} is contained by {1, 3} and {1, 3, 9}
  • {4, 9, 14} is contained by {1, 4, 8, 9, 10, 14, 16}

The resulting vector would then be:

{{1, 3}, {4, 9, 14}, {4, 9, 17, 22}}

As I'm beginning with C++ don't really have any clue of how to do this efficiently. I found, on other answers here, the erase / remove idiom, which doesn't seem to be very appropriate here, except by passing erase a closure as predicate. Which doesn't seem really idiomatic in C++.

Please note that keeping the original ordering doesn't matter, nor does the ordering of values inside each set.

Pierre
  • 6,084
  • 5
  • 32
  • 52
  • 2
    If you keep each vector in sorted order, then you should be able to do that fairly efficiently. – Kerrek SB Sep 24 '13 at 15:25
  • 1
    {1, 3, 9} or {1, 3, 18}? – CS Pei Sep 24 '13 at 15:25
  • 1
    I'm thinking you could do what Kerrek has said: sort the vectors, and then use something like std::unique with an appropriate comparison function. –  Sep 24 '13 at 15:27
  • 2
    Sort each vector, and then sort the whole collection lexicographically. Once that's done, you can traverse and only need to look forward to see if there's proper containment. – Kerrek SB Sep 24 '13 at 15:28
  • In your real application do you store integers inside the sets or is it some other class? – David Grayson Sep 24 '13 at 15:35
  • I'll try filtering with `std::unique` first, thanks. However, about @KerrekSB idea (which I had, at first), I don't see any simple / idiomatic way to filter the vector that way while traversing it. Nor did I find any answer pointing in that direction here. Any clues? Thanks. – Pierre Sep 24 '13 at 15:35
  • @DavidGrayson: directly inside the vectors. – Pierre Sep 24 '13 at 15:36
  • @KerrekSB Still O(n^2), isn't it? – Bernhard Barker Sep 24 '13 at 15:43
  • 1
    "Contains" is a partial ordering. Perform topological sort, then discard everything but the leaves. Topological sort is linear in the number of vertices plus the number of edges; in your case, the number of edges could be quadratic (e.g. `{{1}, {1, 2}, {1, 2, 3} ...}` ), so the algorithm is quadratic worst case, but could be better if the graph is more sparse. – Igor Tandetnik Sep 24 '13 at 16:05
  • 1
    @IgorTandetnik Wouldn't you have to do O(n^2) work just to determine the edges? – Bernhard Barker Sep 24 '13 at 16:06
  • Igor is right, a simple "lexicographical" sort won't work because {4, 9, 14} would be after {1, 4, 8, 9, 10, 14, 16} even though {1, 4, 8, 9, 10, 14, 16} contains {4, 9, 14} –  Sep 24 '13 at 16:24
  • I'm trying (hard) to do this by lexicographically sorting each vector of same size. This *should* work. – Pierre Sep 24 '13 at 16:28
  • I don't think it will work. Consider: {1, 3}, {1, 6}, {1, 3, 9} –  Sep 24 '13 at 16:35
  • @KevinCadieux : Yes. This is already sorted according to what I said. `{1, 3, 9}` will be removed, and that's what I want. The base assumption is : if a vector `V1` contains another vector `V2`, then `|V1| >= |V2|`, and `|V1| = |V2| <=> V1 = V2`. Which IMO makes the solution quite simple. – Pierre Sep 24 '13 at 16:41
  • Yes but it won't work (only) with std::unique in that case because {1,3} and {1,3,9} are not adjacent. For a given set of numbers, you will have to check all the sets that have more elements. –  Sep 24 '13 at 16:45
  • @KevinCadieux : that's true. But I don't really see any better (and simple) solution. Still better than nothing... – Pierre Sep 24 '13 at 16:48
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37959/discussion-between-kevin-cadieux-and-pierre) –  Sep 24 '13 at 16:49
  • Why not use std::set instead of vector and the corresponding set algorithms set_union, set_intersection, set_intersection, set_difference? – Mike Makuch Sep 24 '13 at 20:48
  • @koodawg This does have to work with legacy code using vectors. Wouldn't `std::set` imply some overhead given that the only operation I do need to do on these is well covered by `std::includes`, taking advantage of the fact that these vectors are guaranteed to be sorted by nature. – Pierre Sep 24 '13 at 20:53

1 Answers1

2

Given what I learnt so far, thanks to your very helpful comments, the solution I came up with is:

struct std::vector<size_t> colset;

bool less_colsets(const colset& a, const colset& b) {
  return a.size() < b.size();
}

void sort_colsets(std::list<colset>& l) {
  l.sort(less_colsets);
}

void strip_subsets(std::list<colset>& l) {
  sort_colsets(l);
  for (std::list<colset>::iterator i = l.begin(); i != l.end(); ++i) {
    std::list<colset>::iterator j = next(i, 1);
    while (j != l.end()) {
      if (includes((*j).begin(), (*j).end(), (*i).begin(), (*i).end())) {
        j = l.erase(j);
      }
      else {
        ++j;
      }
    }
  }
}

Note that I replaced the outermost std::vector by std::list which is much more optimised for element removal anywhere.

This seems to work as expected, though I'd need some more tests to prove this. The next step will be to use a more efficient comparison function than includes, which would take into account the fact that each vector is lexically ordered (which the program guarantees). I'll try this tomorrow.

Edit: Looks like std::includes already takes care of this fact. YAY!

Thanks everybody.

Pierre
  • 6,084
  • 5
  • 32
  • 52
  • `std::includes` requires that both ranges be ordered, and is using this property. Its complexity is linear in the total length of two ranges. – Igor Tandetnik Sep 24 '13 at 18:46
  • I didn't see that property... Seems even better than I thought then. Great! – Pierre Sep 24 '13 at 20:25