Guided mining of common substructures in large set of graphs

Question

I have a large (>1000) set of directed acyclic graphs with a large (>1000) set of vertices each. The vertices are labeled, the label's cardinality is small (< 30)

I want to identify (mine) substructures that appear frequently over the whole set of graphs.

A substructure is a graph of at least two directly connected vertices with specific labels. Such a substructure may appear once or more in one or more of the given input graphs. For example "a [vertex labeled A with two directly connected children labeled B] appears twice in graph U and once in graph V".
A substructure we are looking for must obey a set of pre-given rules which filter on the vertices' labels. As an example: A substructure that contains a vertex labeled A is interesting if the sub-graph is "a vertex labeled A that has at least one directly connected child labeled B and is not a directly connected sibling of a vertex labeled U or V". Substructures that do not conform to these rules may appear in the input graphs but are not of interest for the search.

The output we are looking for is a list of substructures and their (number of) appearances in the given graphs.

I have tried to look into things and (as it seems to always happen with me) the problem is NP-complete. As far as I can see gSpan is the most common algorithm to solve this problem. However, as stated above, I'm not looking for any common substructure in the graphs but only those that obey certain rules. One should be able so use that in order to reduce the search space.

Any insight on how to approach this problem?

Update: I should probably add that the aforementioned rules can be recursive up to a certain degree. For example "a vertex labeled A with at least two children labeled B, each having at least one child labeled A". The maximum recursion depth is somewhere between 1 and 10.

Update II: Pointing out that we are not searching for known or preferred substructures but mining them. There is no ~~spoon~~ needle.

Is there any limit on the number of possible labels? is there a specific minimum occurrence count of a "frequent" substructure? Also, if you're looking to solve this problem for an arbitrary set of pre-given rules (as seems to be the case), then you can't really rely on that to substantially reduce the search space, at least asymptotically — etov, Dec 27 '16 at 08:33
There are less than 30 labels; a "frequent" substructure appears at least twice. — user2722968, Dec 27 '16 at 09:49
Can a vertex have more than one edge to like-labelled vertices? — greybeard, Dec 27 '16 at 10:30
[Context Sensitive Language/Grammar](http://cs.stackexchange.com/questions/tagged/context-sensitive)? — greybeard, Dec 27 '16 at 11:02
A vertex can have more than one edge to like-labelled vertices. Short gSnap reference: https://www.cs.ucsb.edu/~xyan/papers/gSpan-short.pdf — user2722968, Dec 27 '16 at 11:37
Bartsch-Spörl, Brigitte: "Grammatical inference of graph grammars for syntactic pattern recognition", Lecture Notes in Computer Science , 153: 1-7, 1983 — greybeard, Dec 28 '16 at 09:25
Is this the only constraint on substructures? The example substructure is much more specific. Does child refer to a direct child in the example? — Stefan Haustein, Dec 28 '16 at 23:54
As far as I can comprehend the problem statement, each DAG of the input can be processed independently (each connected component, even). > 1k vertices, < 30 labels averages >30 vertices/label. Analyse _filter_, collect label histogram, process _most constraining first_ - handwaving. I should better read up on gSpan. — greybeard, Dec 29 '16 at 01:18
@StefanHaustein We only need to consider directly connected vertices. That is, a rule "Vertex labeled A with child labeled B" refers to "A-B" as in "F-O-O-A-B-A-R" but not to "A-...-B" as in "F-O-O-A-R-A-B" — user2722968, Dec 29 '16 at 08:17
one issue i can see is that your 'filter' constraints can be empty in which case you'd still have to solve the np-hard problem. — softwarenewbie7331, Jan 03 '17 at 05:42
Is there an upper bound on the number of vertices? Is there an upper bound on the number of graphs? — Richard, Feb 07 '17 at 21:56
@user2722968 but "F-O-O-A-B-A-R" would by cyclic (as A-B-A) and you stated the graphs are acyclic, no? Also, do you only have exclusion conditions of the parent of a node, or could exclusion also concern a child node? So, is "a vertex A with child B but not child V" a possible condition? — j-i-l, Feb 11 '17 at 00:18
While the DAG itself is always acyclic and all nodes are unique, labels can appear multiple times (so A-B-A is possible). Exclusions do concern the entire neighbourhood of a node, that is parents, siblings and children and their respective neighbourhood. — user2722968, Feb 11 '17 at 10:54
@user2722968 Did you understand my answer? I'm trying to determine if something is missing. My algorithm runs without the need for a filter of any kind. One simply indexes all the permutations that one is mining, and, as described, you may want to recursively find more complex structures to reduce the number you index at each recursive level. — Mouna Apperson, Feb 12 '17 at 00:56

score 7 · Answer 1 · answered Feb 09 '17 at 00:34

I'm assuming in my answer that you are trying to minimize the running time and not wanting to spend an excessive amount of time writing the code to do it. One thing that I struggled with at first when learning to write highly efficient algorithms was that sometimes multiple passes can be way more efficient. In this case, I would say, fundamentally, you want to have two passes:

First, create a filter that allows you to ignore most (if not all) non-duplicated patterns. In order to do this:

Allocate two bit arrays (and consider cache sizes when doing this). The first will be a simple bloom filter. The second will be a duplicate bloom filter.
As you traverse the structure on a first pass, for each indexable structure, compute a hash value. Select the appropriate bit in your first bloom filter and set it. If that bit was already set, also set the corresponding bit in the duplicate bloom filter.

On your second pass, you will need to do the "heavier" process of actually confirming matches. In order to accomplish this:

Scan over the graph again and record any structures that match the duplicate bloom filter generated in the first pass.
Place those that match in a hash table (ideally using different bits from the hash computed)
When a duplicate is detected, store that information off where ever you'd like to collect it.

This algorithm will run very quickly on large datasets because it will significantly reduce the pressure on the appropriate cache level. There are also several enhancements that you can make in order to make it perform better in different circumstances.

In order to improve performance on a multithreaded system, it is actually safe to parallelize the first step. To do this, give each thread (or computer in a cluster) a piece of the graph. Each should compute its own copy of the two blooms. The blooms may then be combined into a final bloom. The reduction function is just (present, duplicate) = (present1 OR present2, duplicate1 OR duplicate2 OR (present1 AND present2)). This step is VERY fast.
It is also completely safe to parallelize the second step, but it must be modified slightly. In order to do that, you will take the duplicate bloom filter from the first step and use that as a filter in the second step, the same as before. However, you can't complete the final comparison as easily. You must instead place the potential duplicates in hash buckets. Then, after each shard of data has been written into its own list of potential duplicate hash table, divide the data up by hash bucket and in a third step find the duplicates. Each hash bucket (from any output in the second step) must be processed by the same worker.
In cases where you have a large number of structures that you are indexing, you may improve performance by recursively applying the above algorithm. The adjustment is that you use each matching category for the output from the above algorithm as your input into the recursive pass. For example, if you index only structures that have up to 5 items in the first run of the algorithm, you can, when you recurse, take each set of duplicated subgraphs and run the algorithm on only those sub-graphs. This would only be necessary with very large sets of data, obviously.
Another adjustment you may consider if the graph is very large in order to increase the effectiveness of your bloom filters is to iterate on the algorithm. In the first pass, for example, you might only consider sub-graphs that have the first label as the base of the sub-graph. This would decrease the size required of your bloom filters and/or allow you to filter out more sub-graphs on the first pass.

A couple notes for tweaking the above:

Consider cache sizes. For example, on an Intel Haswell chip, each core has 32K in L1 cache and 256K in L2 cache. Each cache line will contain 512 bits, so if you fill up 1% of your bloom filter, most of the cache lines will be touched. Depending on how fast other parts of the algorithm are and given that other stuff uses these caches, you can safely create a bloom filter that has up to around 512 * 1024 entries (8 entries per bit per filter = 128k, on hyperthreaded systems, that is how much L2 you get) and still maintain most of the data set in L2 cache and the really active stuff in L1. For smaller datasets, keep this number down because there is no point in making it large. If you are only flagging features as potential duplicates when they aren't less than 1% of the time, that's totally fine.
Parallelizing this is, again, only really useful in cases where you have tons of data. I'm assuming that you might. If you do parallelize, you should consider the geometry. Placing partial sets of data on each computer will work with this algorithm. You can then run each iteration (in variation #4) on each computer. In cases where you have huge datasets that will avoid having to transfer all the data to all the computers.

Anyway, to sum up with a statement on the run-time complexity, I will say that it really depends. Many people ignore the fact that increasing the working set of data will cause memory accesses to not all be equal in cost. Fundamentally, the above algorithm, while highly performant, if tuned appropriately, will run very fast on a small data-set, but it really shines with much larger datasets because it allows high efficiency ways of keeping the working set of data in whatever cache level is appropriate (L1, L2, L3, RAM, local disk, local network, etc.) The complexity of the algorithm will depend on the data, but I do not believe an algorithm much faster can be created. I did leave out how you represent the subgraphs and there is work to be done there to achieve the optimal algorithm, but without knowing more, I can't determine the best way to store that information.

The reason that an algorithm can't run much faster than the one I've presented is that the first pass will require much less work to run than the second because it doesn't require branching and it is less work to do bitwise operations. We can therefore say that it adds little to the overall work we're doing. The second stage is also about as efficient as is possible. You must (barring a way to perfectly describe each possibility with a finite set of bits, which I'll explain a second) actually compare each graph feature and write the information somewhere. The only variable is how much work it is to check whether you need to do this. Checking a bit where you can arbitrarily scale the error rate towards 0% is as good as you can get.

For smallish datasets, the reason that two passes benefit you is that you may have a much larger number of bloom cardinality in a smaller amount of memory. But for really small sets of data, you might as well just use the second step and ignore the first. But, at a minimum, you'll need to store a pointer for each hash target. This means that you will need to write 32 or 64 times as much data for the same level of filtering. For small enough datasets, this doesn't matter because a read is a read and a write is a write, but for larger datasets, this can allow you to accomplish the same level of filtering while staying in a given level of cache. In cases where you must work across multiple computers or threads, the mechanism provided in this algorithm will be WAY more efficient as the data can be combined much faster and much more information about potential matches can be exchanged.

Now, lastly, as I alluded to, you may be able to get slightly better if the number of features that you check for on each iteration is reduced further. For example, if you are only checking for 32 possible labels and the number of children with a particular label in each pass (and this is bounded to 1024), you could represent this perfectly with 15 bits. If you limited the count to 255, you could store this information perfectly with 32K. In order to pull this off in your case, you'd need to use the iteration, recursion and sharding strategies that I mentioned above and you'd need to then also track the source graph, and some other information. I honestly doubt this would work well except in very limited situations, but I'm including it for completeness.

Anyway, this is my first answer on Stack Overflow, so don't be too hard on me. I hope this was helpful!

I don't know anyone who started out with a canonical answer. You did good. — Nissa, Feb 09 '17 at 01:07

score 2 · Answer 2 · answered Feb 08 '17 at 17:19

They way I read your question, you may want something like the code below. It finds all matching subgraphs in a DAG in linear time. It doesn't support filters, but you can check the results after they are found, and filter them manually. It also may find graphs with some parts collapsed. Say you are looking for a tree a((b|c)|(c|d)), then it might find a subgraph, where the c node is shared between the two subtrees. Again, you can inspect the output and filter out results like that. Doing such an inspection is of course only possible if the output size is not too large. For that you will have to do some experiments on your own graphs.

from collections import namedtuple, defaultdict
Node = namedtuple('Node', ['label', 'children', 'id'])

# Simple tree patternA(B|AB)
pattern = Node('A', (Node('B', (), 1),
                     Node('A', (Node('B', (), 3),), 2)), 0)

# Generate random DAG
import random
labels = 'ABCDE'
dag = []
for _ in range(1000):
    label = random.choice(labels)
    children = tuple(random.sample(dag, min(len(dag)//2, 10)))
    dag.append(Node(label, children, len(dag)))

# Helper
def subtrees(pattern):
    yield pattern
    for sub in pattern.children:
        yield from subtrees(sub)

colors = defaultdict(list)
# Iterate the nodes in topologically sorted order
for node in dag:
    # Check if the node is the head of some sub-pattern
    for sub in subtrees(pattern):
        if node.label == sub.label \
                and all(any(sc.id in colors[nc.id]
                    for nc in node.children) for sc in sub.children):
            # If so, color the node with that sub-pattern's color
            colors[node.id].append(sub.id)

matches = [node for node in dag if pattern.id in colors[node.id]]
print('Found {} matches.'.format(len(matches)))

I believe this is very similar to the approach Stefan Haustein had in mind.

Stefan Haustein · Answer 3 · 2016-12-28T23:59:34.490

0

Edit: Here is what I'd start from:

Build an index of the 30x30 possible parent/child combinations to the corresponding nodes
Intersect the matches for a given substructure
Check further conditions manually

(Original post):

Find a way to build hash keys for substructures
Build a hash map from substructures to the corresponding nodes
Find candidates using the hash map, check the detailed conditions manually

edited Dec 28 '16 at 23:59

answered Dec 27 '16 at 10:20

Stefan Haustein

18,427
3
36
51

(1. looks _hard_, 2. memory intensive.) – greybeard Dec 28 '16 at 18:48
1

@graybeard For the example, 1. could just be a string hash for "abb". But what I have overlooked is that you'd also need to insert "a", "ab" and all other combinations. So unless the edge count is limited for each node, this might explode. Re 2) A hash map with a small multiple of about one million entries seems manageable, though? – Stefan Haustein Dec 28 '16 at 23:46

score 0 · Answer 4 · answered Feb 12 '17 at 06:21

Your question:

You have - A set of graphs and a set of rules (Let's call the rule a substructure pattern).

You want - A count of the occurrence of each of the substructure in the set of graphs.

Since, the graphs are DAGs, in the substructure search you won't be caught in cycle.

The simple solution pseudocode is:

for each graph G {                           //Sub-problem 4
    for each node N {                        //Sub-problem 3
        for each substructure pattern P {    //Sub-problem 2
            if N's structure is like P {     //Sub-problem 1
                PatternCountMap.Get(G).Get(P)++;
            }
        }
    }
}

At each place I have marked the sub-problem that needs to be handled.

^{If you don't know Map-Reduce, my solution won't be entirely clear to you. Let me know if that's the case. In general, the Map-Reduce code can always be run in a general programming fashion, except that it will take longer time for large data.}

Sub-Problem 1

This problem can be simply written as:

Given a 'Root' node and given a pattern P, does the tree represented with this node as root follow the given pattern?

This problem is solvable. Simply travel down the graph starting from the 'root' and see if pattern is being followed. If it is, increase its count in the PatternCountMap, otherwise move on to the next pattern and see if the 'root' follows the next pattern.

The PatternCountMap is a HashMap>, which maps the Graphs to another HashMap which maps Patterns to their frequency. So, if P is found in Graphs G₁ and G₂, 12 and 17 times respectively, then PatternCountMap.Get(G₁).Get(P) will be 12 and PatternCountMap.Get(G₂).Get(P) will be 17 at the end of this algorithm's run.

Useful Hint: Since you do not want to recurse too deep, use iterative solutions. If you have to perform DFS, perform iterative DFS using a stack. The iterative DFS algorithm is pretty easy.

Sub-problem 2

Here we are just looping over each pattern (or rules). No magic here. For each rule we see if the node N of Graph G follows the rule.

Useful Hint: Preprocess the rules. For example, if one rule is followed, see what other rules can definitely not be followed to skip them. Or, if following one pattern means that another one can be followed too, see if the second rule can be shrunk because of the checking already done as part of the first one.

Sub-problem 3 & 4

These two are simple loops again like the Sub-problem 2. But there is one idea that can be applied here. And that is Map-Reduce (though ^[1]Map-Reduce does not 100% qualify for these problems).

You have numerous nodes from numerous different graphs. As long as you can identify the graph to which the node belongs, if a particular node follows a certain pattern, you can emit <N_G, P>, which means that Node N in Graph G follows the pattern (aka rule) P.

The map output can be collected in the reducers which can populate the PatternCountMap with the values. Much of that is handled by the Map-Reduce framework itself so a lot of things will be taken care of automatically for you.

After you have the PatternCountMap created, you have the count of each useful pattern in each graph and that is what you wanted.

^{^[1]Map-Reduce is for problems that can be solved on commodity hardware. If the rules you are mining are complex, then commodity hardware may not be the one you want to run your algorithm on.}

Guided mining of common substructures in large set of graphs

4 Answers4