5

An important part of automated theorem proving is cutting down redundancy by figuring out when one clause subsumes another.

Intuitively, a clause (first-order logic formula in CNF) C subsumes another clause D when it is at least as general. The specific definition is that there must be a substitution of variables to terms that turns C into a sub-multiset of D. (Not subset; that would let a clause subsume its own factors, which would break completeness of some of the best saturation calculi such as superposition.)

There are indexing techniques that greatly reduce the number of subsumption attempts that must be made, but even so, subsumption can consume a great deal of CPU time, so it is important to optimize it. Apparently it is known to be NP-hard in the general case, but it is still both possible and necessary to make most particular cases run fast.

The following pseudocode is correct but inefficient. (In practice, negative and positive literals have to be handled separately, and then there are issues like trying equations oriented both ways, but here I am just considering the core algorithm for matching two bags of literals.)

def match_clauses(c, d, map):
    if c == {}:
        return true
    for ci in c:
        for di in d:
            if match_terms(ci, di, map):
                if match_clauses(c - ci, d - di, map):
                    return true
    return false

Why is it inefficient? Consider the two clauses p(x) | q(y) | r(42) and p(x) | q(y) | r(54). The above algorithm will first successfully match p(x), then successfully match q(y), then notice r(42) does not match r(54). Okay, but then it will try it the other way around: first successfully match q(y), then successfully match p(x), then again notice r(42) does not match r(54). If there are N literals that do match, the wasted work will be N factorial, a crippling inefficiency in some practical cases.

I could doubtless figure out a better algorithm given enough time, but other people must've done this before me, so it seems worth asking: What is the best known algorithm for this?

rwallace
  • 31,405
  • 40
  • 123
  • 242
  • I think that CS may be a better place to ask for "best known algorithm". As a practical suggestion, though, I would suggest a search for subsumption that can be paused and then returned to, and then have your theorem proving algorithm have breakpoints where you consider whether it has been running long enough that it is worth trying to analyze further for subsumption and start over if you find what you think should be a faster search. So you do little analysis on easy problems, and a lot on hard ones. – btilly Jan 04 '19 at 21:35
  • Subsumption is indeed the most time consuming part of resolution-type loops since the number of subsumption attempts might be growing as fast as the number of pairs of clauses in the working set. It seems that term indexing techniques might help make individual subsumption attempts more efficient I remember reading about it in the _Handbook of Automated Reasoning_ but I am afraid I am unable to answer from memory without looking it up again. – Dima Chubarov Jan 06 '19 at 07:59
  • @DmitriChubarov I don't see any mention of indexing in the handbook, only of subsumption of individual clauses (and that only at the level of sets, not multisets). Are you sure you didn't read it somewhere else? My current understanding is that term indexing only reduces the number of individual subsumption attempts, not the time taken for each, though I will freely grant that my understanding could be incomplete. – rwallace Jan 07 '19 at 18:46

0 Answers0