I have a large (>1000) set of directed acyclic graphs with a large (>1000) set of vertices each. The vertices are labeled, the label's cardinality is small (< 30)
I want to identify (mine) substructures that appear frequently over the whole set of graphs.
- A substructure is a graph of at least two directly connected vertices with specific labels. Such a substructure may appear once or more in one or more of the given input graphs. For example "a [vertex labeled A with two directly connected children labeled B] appears twice in graph U and once in graph V".
- A substructure we are looking for must obey a set of pre-given rules which filter on the vertices' labels. As an example: A substructure that contains a vertex labeled A is interesting if the sub-graph is "a vertex labeled A that has at least one directly connected child labeled B and is not a directly connected sibling of a vertex labeled U or V". Substructures that do not conform to these rules may appear in the input graphs but are not of interest for the search.
The output we are looking for is a list of substructures and their (number of) appearances in the given graphs.
I have tried to look into things and (as it seems to always happen with me) the problem is NP-complete. As far as I can see gSpan is the most common algorithm to solve this problem. However, as stated above, I'm not looking for any common substructure in the graphs but only those that obey certain rules. One should be able so use that in order to reduce the search space.
Any insight on how to approach this problem?
Update: I should probably add that the aforementioned rules can be recursive up to a certain degree. For example "a vertex labeled A with at least two children labeled B, each having at least one child labeled A". The maximum recursion depth is somewhere between 1 and 10.
Update II: Pointing out that we are not searching for known or preferred substructures but mining them. There is no spoon needle.