2

This question is almost the opposite of Efficient data structure for word lookup with wildcards

Suppose we have a database of urls

http://aaa.com/
http://bbb.com/
http://ccc.com/
....

To find if a url is on the list I can make a binary-search and get the results in O(log n) time, n the size of the list.

This structure served well for many years but now I'd like to have wildcards in the database entries, like:

http://*aaa.com/*
http://*bbb.com/*
http://*ccc.com/
....

And the naive search would result in a full scan with O(n) time for finding.

Which data structure could have find in less than O(n)?

Community
  • 1
  • 1
ppaulojr
  • 3,579
  • 4
  • 29
  • 56

1 Answers1

2

If the all the urls are known beforehand, then you could just build a finite automaton, which will solve your problem with queries in O(url length).

This finite automaton can be built as a regexp:

http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$

Here's some python code. After re.compile(), each query is very fast.

import re

urls = re.compile("http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$")

print urls.match("http://testaaa.com/") is not None
> True
print urls.match("http://somethingbbb.com/dir") is not None
> True
print urls.match("http://ccc.com/") is not None
> True
print urls.match("http://testccc.com/") is not None
> True
print urls.match("http://testccc.com/ddd") is not None
> False
print urls.match("http://ddd.com/") is not None
> False
Ricbit
  • 809
  • 8
  • 15
  • I guess you cannot `re.compile` a very large string :) – ppaulojr Dec 23 '14 at 19:04
  • 1
    If the regexp implementation is not up to the task, you can always build the automaton yourself. This will provide you better control over how much memory is used. – Ricbit Dec 23 '14 at 19:12