7

I'm looking for a good data structure to represent strings of the form:

Domain:Key1=Value1,Key2=Value2...
  • Each "Domain" can contain the following pattern characters - *, ? (* - 0 or more characters, ? - 0 or 1 character)

  • Each "Key" can contain the following pattern characters - *, ? (* - 0 or more characters, ? - 0 or 1 character)

  • Each "Value" can contain the following pattern characters - *, ? (* - 0 or more characters, ? - 0 or 1 character)

Examples:

JBoss:*
*:*
JBoss:type=ThreadPool,*
JBoss:type=Thread*,*
JB*:name=http1,type=ConnectionPool

If you are familiar with JMX ObjectName's then essentially this is the ObjectName pattern.

I'm looking for ways to easily store a rule corresponding to each pattern and be able to quickly delete,update and add new rules.

I started out by using a Prefix Trie, but got stuck with the pattern characters *, ?.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Raj
  • 71
  • 1
  • How do you resolve conflicts on matching rules, is it like a longest prefix match? For instance, if you had the two rules, JBoss:* and * : *, JBoss:key=value would match both. Do you pick the longest match? – kyun Jun 17 '11 at 00:34
  • Hi Kyun, You are right I would pick the longest match. – Raj Jun 17 '11 at 17:32

3 Answers3

1

I think the easiest way to do it would be to build an NFA like trie, one which allows transitions to multiple states. This, of course, adds the complexity of having another data structure which maps to multiple states given a set of characters to match. For instance, with your example:

JBoss:*
*:*
JBoss:type=ThreadPool,*
JBoss:type=Thread*,*
JB*:name=http1,type=ConnectionPool

Lets say you try to match JBoxx:name=*

When you match the substring JBo, you would need a data structure to hold the states JBo and JB* and * since you have three branches at this point. When the x comes in, you can discard the JBo route since its a dead end and use JB* and *. The easy implementation is to have a set of possible match states at any given time and try the next character on each of them. You would also need a way to resolve multiple matches (as in this case) - perhaps something as simple as a longest match?

It all seems to make sense when you think of the trie as an NFA instead of the well-accepted DFA format. Hope that helps.

kyun
  • 640
  • 4
  • 10
  • Kyun,Thanks for the suggestions. I believe a similar problem exists in Topic routing in messaging frameworks where messages have to be routed based on similar patterns. Here is a link : http://www.rabbitmq.com/blog/tag/dfa/ – Raj Jun 17 '11 at 18:19
  • Looks very similar! I was about to suggest converting your rules to a DFA once you have the NFA but that'll likely make your rule updates slower? – kyun Jun 17 '11 at 19:29
0

I believe you want to use a rope

Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
0

You can take a look at this other thread: Efficient data structure for word lookup with wildcards

Or this site: Wildcard queries

The second site ends with "We can thus handle wildcard queries that contain a single * symbol using two B-trees, the normal B-tree and a reverse B-tree.".

This may be over the top for you, but it may be worth the read.

Good luck

Community
  • 1
  • 1
alexbt
  • 16,415
  • 6
  • 78
  • 87