Java regex very slow (translate nested quantifiers to possessive quantifiers)

Question

I've found this regular expression to match urls (originally in Javascript by Daring Fireball) which in java works but in some cases is extremly slow:

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

and i've found on topic: Java Regular Expression running very slow that the problem is in this block of code:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

and it seems that to solve the problem i need to make these inner quantifiers possessive (which actually are nested), but i don't know how to do that Thanks in advice and sorry for my BAD english!

Did you also check to init a Pattern instead of initializing a String ? But this will not solve your problem completely — fyr, Jul 16 '11 at 08:51

Stephen C · Answer 1 · 2011-07-16T14:24:34.153

3

You can avoid all of this by using java.net.URL or java.net.URI to parse the urls.

java.io.URI does a better job of parsing than java.net.URL. Try that one.
Once you've parsed the url, you can check each of the components; e.g. check that the hostname can be resolved.
If you want urls that will resolve, you need to distinguish between absolute and non-absolute urls, and check that the "scheme" is one that you can cope with.
You cannot check that a url works (i.e. that it corresponds to a retrievable resource) without actually attempting to open the resource. And even that isn't definitive test, for a number of possible reasons.

edited Jul 16 '11 at 14:24

answered Jul 16 '11 at 09:35

Stephen C

698,415
94
811
1,216

i alredy tried it, but the only check it seems to do is when calling the constructor and it doesn't trow malformedurlexception even for a simple (in my opinion malformed) url like "http://" (it means the url is valid for him), of course something like "asdfasdf" throws exception... and i didn't find any ther posible check for java.net.URL and java.net.URI, am i wrong? and i don't want to open connections to check the url is working (what if i have no connection to internet?) – UnableToLoad Jul 16 '11 at 13:40
@Unable: You're mistaking whether things are well-formed with whether they are valid. The URI `http://` is well-formed, but is almost always not what you want — it will fail a validity check, but those are things that _you_ have to create. That's because validity is a much higher level concept: it's determined at the application level, not the spec level. It's _your_ app that says what schemes are required, that a non-empty hostname part is needed, etc. General URIs are much more flexible than that though (they can even lack a hostname completely; useful in some circumstances). – Donal Fellows Jul 16 '11 at 14:30

score 0 · Answer 2 · answered Jul 16 '11 at 08:57

0

You might have a case of catastrophic backtracking: Check that your regex doesn't match the same characters in multiple groups, causing a runaway number of combinations that must be checked.

See this article for an explanation.

answered Jul 16 '11 at 08:57

Bohemian

412,405
93
575
722

that article is very intersting and yes, you're right, that's the problem, but the block of code is actually to hard for me to fix, cause i understand the regex tries to match as many [^\\s()<>] as it can in too many combinations (backtracking exponentially), and that's fine, but how to solve? – UnableToLoad Jul 16 '11 at 09:11

Java regex very slow (translate nested quantifiers to possessive quantifiers)

2 Answers2