Efficient data structure to hold strings with wildcards

Question

This question is almost the opposite of Efficient data structure for word lookup with wildcards

Suppose we have a database of urls

http://aaa.com/
http://bbb.com/
http://ccc.com/
....

To find if a url is on the list I can make a binary-search and get the results in O(log n) time, n the size of the list.

This structure served well for many years but now I'd like to have wildcards in the database entries, like:

http://*aaa.com/*
http://*bbb.com/*
http://*ccc.com/
....

And the naive search would result in a full scan with O(n) time for finding.

Which data structure could have find in less than O(n)?

you could still do binary search, but maintain the sorted lists of know urls with strings starting from behind — advocateofnone, Dec 23 '14 at 18:01
is http ://sasccc.com a valid query ie without a dot separator ? — advocateofnone, Dec 23 '14 at 18:23
could you split the urls into a fixed number of fields, where a field could be wild or specified? or do you need wild cards to be able to appear anywhere in the url (e.g `http*://*ca*.c/*/*.html`)? — ryanpattison, Dec 23 '14 at 18:38
possible duplicate of [Efficient data structure for word lookup with wildcards](http://stackoverflow.com/questions/2815083/efficient-data-structure-for-word-lookup-with-wildcards) — 500 - Internal Server Error, Dec 23 '14 at 19:03

Ricbit · Accepted Answer · 2014-12-23T18:51:40.323

2

If the all the urls are known beforehand, then you could just build a finite automaton, which will solve your problem with queries in O(url length).

This finite automaton can be built as a regexp:

http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$

Here's some python code. After re.compile(), each query is very fast.

import re

urls = re.compile("http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$")

print urls.match("http://testaaa.com/") is not None
> True
print urls.match("http://somethingbbb.com/dir") is not None
> True
print urls.match("http://ccc.com/") is not None
> True
print urls.match("http://testccc.com/") is not None
> True
print urls.match("http://testccc.com/ddd") is not None
> False
print urls.match("http://ddd.com/") is not None
> False

edited Dec 23 '14 at 18:51

answered Dec 23 '14 at 18:31

Ricbit

809
8
15

I guess you cannot `re.compile` a very large string :) – ppaulojr Dec 23 '14 at 19:04
1

If the regexp implementation is not up to the task, you can always build the automaton yourself. This will provide you better control over how much memory is used. – Ricbit Dec 23 '14 at 19:12

Efficient data structure to hold strings with wildcards

1 Answers1