12

I have a list of rules in the form

L1 -> (A, B, C)

L2 -> (D, E),

L3 -> (F, G, A),

L4 -> (C, A)

.....

This list contains ~30k such rules.

I have an input in the form (X, Y, Z)

This creates a method

List <Rule> matchRules(input)

Which belongs to a class RuleMatcher

I started with a very simple clear naive solution, in order to get the framework down, get something working.

public RuleMatcher(Collection<Rule> rules) {
   this.rules = rules;
}

public Collection<Rule> matchRules(List<Token> input) {
   List<Rule> matchingRules = new ArrayList<>();
   for(Rule r: this.rules) {
        if(r.matches(input)) {
            matchingRules.add(r);
        }
   }
   return matchingRules; 
}

Where matches is a very simple function that checks if the lengths are the same, and then checks each token as a for loop.

This matchRules function is called in the magnitude of billions of times.


Obviously this is a very poor implementation. According to my profiler at least half of the execution time is is spent in this matches function.

I was thinking of two possible solutions:

A. Some sort of Trie data structure holding the chains of rules which could be matched.

B. some sort of hash function. Each symbol is given a unique identifier. Unfortunately, there are about 8 thousand unique symbols so this might be difficult.

C. Make a hashmap conditioning on the size of the right hand side, the number of tokens in the rule. Unfortunately, the majority of the rules are about the same size, so this may not even be worthwhile.

D. Some awesome solution one of you come up with.

I hope somebody can shed some light on this problem.


Edit: A token is just an object with a unique number. For example "NN" is a token. Each instance of "NN" is exactly the same.

Matches code:

public boolean rhsMatches(List<Token> tokens) {
   if(tokens.size()!=rhsSize()) return false;
   for(int i = 0;i<rhsSize();i++) {
      if(!rightSide.get(i).equals(tokens.get(i)) {
        return false;
      }
   }
   return true;
}

Its not very pretty, but its simple.

user498001
  • 244
  • 1
  • 6
  • 2
    Could you give us the definition of the tokens. Without knowing what is being matched and how the matching is done it will be difficult to propose an optimization. – Leonard Brünings Jan 16 '14 at 16:58
  • 1
    So you have 30,000 rules (L1, L2, ...) containing sets of 8,000 unique tokens (A, B, ...) correct? Have you considered creating a "reverse lookup table" (can't remember the actual name) where you index which rules the tokens are in? This may take a lot of memory, but speed should increase greatly. – Uxonith Jan 16 '14 at 17:00
  • You can use some another hash (like checksum) for keys, not only the length. And, yes, `matches` code would be helpful. – khachik Jan 16 '14 at 17:01
  • 1
    I'd say your idea of a `TrieSet` would be your best first-hit. Essentially - you need to build a grammar. – OldCurmudgeon Jan 16 '14 at 17:06
  • Well the most basic optimization would be skip the length check by pre-sorting the rules, so that you have a different rule list for each length. – Leonard Brünings Jan 16 '14 at 17:06
  • So that "reverse lookup table" I mentioned earlier is actually [inverse index](http://en.wikipedia.org/wiki/Inverted_index). – Uxonith Jan 16 '14 at 17:11
  • @Op since you are returning a collection of rules. Can there be more than one matching rule, if so isn't it just a duplicate then? – Leonard Brünings Jan 16 '14 at 17:18
  • @LeonardBrünings I added the updates as requested – user498001 Jan 16 '14 at 17:33
  • @khachik I added the match code – user498001 Jan 16 '14 at 17:33
  • 1
    I think that HashMap, Rule> should work, if you override hashCode and equals methods on Token class. Trie would be far more efficient when it comes to memory usage – Sami Korhonen Jan 16 '14 at 18:01

2 Answers2

1

Why not sort your rule list to begin with. Then you can binary search for the matching rule.

ElKamina
  • 7,747
  • 28
  • 43
  • This would require the `Rule` to implement `Comparable`. That sounds like the easiest and fastest solution to me. Depending on how often items are added to the list, the list should be a `SortedMap` or `SortedTree`. – Uxonith Jan 16 '14 at 18:01
0

To me it looks like a perfect scenario for engaging some worker threads. Tasks of matching seem independent of each other, divide the list of rules and delegate the matching to workers, if its possible in your situation.

janek
  • 184
  • 1
  • 6