3

For an application I'm considering, there would be a large (100,000+) 'database' of trees (think expressions in a programming language, or S-expressions), and I would need to query that database for expressions that match a specific given expression.

Before giving the details of what I'd like to have, note that I'd appreciate any information related to indexing a large set of trees for optimizing lookup by a subtree.

In my specific situation (which would be for a backend to be used by Metamath proof assistants), expressions have the following structure (in Haskell-like notation):

data Expression = Placeholder Id | VarName Id | ConstName Id [Expression]

or as a BNF for an S-expression form:

Expression = '?' Id | Id | '(' Id Expression* ')'

where Id is some kind of identifier.

For example, I could have a database with expressions like

(equiv ?ph ?ps)
(not (in (appl (sqrt) (2)) (Q)))
(equiv (eq ?A ?B) (forall ?x (equiv (in ?x ?A) (in ?x ?B))))

In this context, two expressions match if they can be made equal by substitution of expressions for placeholders. So looking up (equiv (eq A (emptyset)) ?ph) in the above mini-database would result in the first and last expressions.

So again: how would I implement fast lookups in a large set of (expression) trees with placeholders? What kind of index data structure could I use?

  • I couldn't find a tag like 'reference-request', but of course I also appreciate pointers to literature. (Like most things in CS, this probably was already done in the 70s. :-) – MarnixKlooster ReinstateMonica Aug 11 '17 at 05:07
  • Could you explain a little more why the first and third expressions match? Thanks. – Primusa Feb 15 '19 at 00:18
  • @Primusa In the first, replace `?ph` by `(eq A (emptyset))` and `?ps` by `(forall ?x (equiv (in ?x A) (in ?x (emptyset))))`. In the third, (consistently) replace `?A` by `A` and `?B` by `(emptyset)` (and `?x` by itself). This makes both equal to `(equiv (eq A (emptyset)) (forall ?x (equiv (in ?x A) (in ?x (emptyset)))))`. – MarnixKlooster ReinstateMonica Feb 15 '19 at 06:02
  • @Primusa I'm sorry, that previous comment explained how the first and third match _each other_. – MarnixKlooster ReinstateMonica Feb 15 '19 at 06:11
  • @Primusa The first matches the query by (e.g.) replacing `?ph` in that first expression by `(eq A (emptyset))` (and `?ps` by itself) and replacing `?ph` in the query by `?ps`, making both equal to `(equiv (eq A (emptyset)) ?ps)`. The third matches the query by replacing `?A` by `A` and `?B` by `(emptyset)` (and `?x` by itself) in that expression, and the query's `?ph` by `(forall ?x (equiv (in ?x A) (in ?x (emptyset))))`, which makes both equal to `(equiv (eq A (emptyset)) (forall ?x (equiv (in ?x A) (in ?x (emptyset)))))`. – MarnixKlooster ReinstateMonica Feb 15 '19 at 06:11
  • Is there any confusion regarding my answer? You posted a bounty because this question wasn't receiving enough attention, but I haven't received any feedback on my answer in comments, votes, the bounty itself, or otherwise. All I can assume is that it was not helpful. – Dillon Davis Feb 22 '19 at 08:47
  • @DillonDavis My sincere apologies. For some reason I missed this comment, and the bounty. Tomorrow it is coming your way. – MarnixKlooster ReinstateMonica Nov 18 '20 at 13:06
  • @MarnixKloosterReinstateMonica no worries- it happens to the best of us – Dillon Davis Nov 21 '20 at 18:03

1 Answers1

2

I would implement the lookup with a trie. Each key would consist of one of the following:

  • ConstName Identifier
  • Variable w/ context info
  • ConstValue
  • Placeholder

These should be ordered in some fashion- possibly Placeholder, then all ConstNames (alphabetical), then variables (scope ordering, then argument order), then ConstValues (numerical order). As long as there's a concrete ordering for usage in the trie, you're fine.

Traverse the expression's tree, injecting the appropriate keys into the trie as they are encountered. Do this for all the expressions you want to insert into your data structure. When it comes time to query it, you can traverse the trie in a similar fashion, but with a few new rules.

  • Everything matches a placeholder node. If it matches some other key as well, then you'll need to explore both branches (easily done via a recursive DFS-like approach).
  • A placeholder matches everything. This is not equivalent to the previous point- we are talking about placeholders in the query here, the previous bullet is regarding placeholders as trie keys.

Now, this does mean that the search space can somewhat "explode" as you encounter placeholders, but there is one thing you can do to try to mitigate this in practice. Traverse the expression's tree in a breadth-first fashion (both in construction of the trie, and querying). This means if one of the arguments is a placeholder, you won't have to full-depth search every single subtree that matches that expression so far- instead you jump ahead to the next argument- which may not be a placeholder, and will thus greatly prune the search space (compared to matching "everything").

For completeness sake, lets take one of your examples

(not (in (appl (sqrt) (2)) (Q)))

and make a trie entry from that-

not -> in -> apply -> "Q" -> sqrt -> 2

adding (not (in ?ph E)) to this would result in-

not -> in -> apply -> "Q" -> sqrt -> 2
         \-> ?ph   -> "E"

Continue in this fashion injecting expressions into the trie. Also traverse in this fashion for querying until you reach the ends of your searches into the trie, and return those that matched.

Note- the uniqueness of these entries is based on the assumption you do not have to support variadic functions. If you do, attach to each key some context info (read the next paragraphs for info on how to do this) to distinguish which arguments go to which functions

There is one detail I glossed over- variables. If you only want it to match if they are the exact same variable name, then no work is necessary. But this likely isn't what you want; you probably want it to match generic variables as long as they are "consistent" with each other. The way to do this is to assign each variable an identifier that represents the scope of which it was first defined.

The easiest way to do this is just compose an identifier from the concatenation of the argument ordering of its ancestors. That is, if a variable is first defined as the second argument to a function which is the fifth argument to the root function, then we might label it as (5, 2) or (2, 5), whichever makes more sense intuitively. Either way, this will ensure the variable is given a consistent identifier regardless of other variables / functions elsewhere. Then proceed as normal with this new variable name.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Dillon Davis
  • 6,679
  • 2
  • 15
  • 37