10

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.

Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?

My thoughts are:

  1. I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.

  2. I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.

Thoughts, anyone?

(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)

jscs
  • 63,694
  • 13
  • 151
  • 195
user1056788
  • 101
  • 3
  • OK, I think some of the asterisk characters I put in may have been stripped out. The equality test in the first para needs to have one in, as does the alternative text that a sneaky person might use. In any case, I'm sure you get the point... – user1056788 Nov 20 '11 at 20:57
  • Wow, you need a regular expression to test for certain regular expressions, how meta. Be interesting to see answers to this one. See the [quote at the top of that post](http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html): you now have 3 problems! – Jeroen Nov 20 '11 at 21:00
  • Similar to http://stackoverflow.com/questions/2131239/distance-between-regular-expression, but not a dupe I think. – dsolimano Nov 20 '11 at 21:06
  • If you want to do this *properly* and don't mind both (1) some complex, hard-to-grasp (for the faint of heart, anyway) algorithms and (2) some restrictions on advanced features (exist in many modern regex libraries, but not in automata theory - stuff like unrestricted back-references), you can construct an DFA from the regex and minimize that. There are well-known algorithms with reasonable complexity, and they're *correct*. Not some easily-subverted . The only thing they won't catch for you is stuff like `.*|very unlikely string`, though it makes further blacklisting easier. –  Nov 20 '11 at 21:20
  • 2nd seems to be OK, if it matches 30 randomly generated strings (short and long, special chars etc..) then it's a useless regexp. – Karoly Horvath Nov 20 '11 at 21:33
  • @user1056788 Please mark the accepted answer when you get a chance. – Raymond Hettinger Nov 28 '11 at 23:16

3 Answers3

8

Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions

You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html

If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
1

There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.

These are some examples that would match the same than .* (they will additionally match the newline characters)

/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/

So I assume your idea 2 would be a lot easier to achieve.

stema
  • 90,351
  • 20
  • 107
  • 135
0

Thanks everyone,

I did miss the testing for equivalence entry on the wikipedia, which was interesting.

My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!

I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.

Now to decide what type of strings to generate in order to run the tests....

Kind regards, Russ.

user1056788
  • 101
  • 3
  • Rather than answering it yourself, you should choose the most appropriate answer given. This allows the person answering your question to get credit (i.e. reputation) for it and makes it easier for others to find the solution in the future. – Chris E Nov 21 '11 at 16:58