1

I've found this regular expression to match urls (originally in Javascript by Daring Fireball) which in java works but in some cases is extremly slow:

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

and i've found on topic: Java Regular Expression running very slow that the problem is in this block of code:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

and it seems that to solve the problem i need to make these inner quantifiers possessive (which actually are nested), but i don't know how to do that Thanks in advice and sorry for my BAD english!

Community
  • 1
  • 1
UnableToLoad
  • 315
  • 6
  • 18

2 Answers2

3

You can avoid all of this by using java.net.URL or java.net.URI to parse the urls.


  1. java.io.URI does a better job of parsing than java.net.URL. Try that one.

  2. Once you've parsed the url, you can check each of the components; e.g. check that the hostname can be resolved.

  3. If you want urls that will resolve, you need to distinguish between absolute and non-absolute urls, and check that the "scheme" is one that you can cope with.

  4. You cannot check that a url works (i.e. that it corresponds to a retrievable resource) without actually attempting to open the resource. And even that isn't definitive test, for a number of possible reasons.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • i alredy tried it, but the only check it seems to do is when calling the constructor and it doesn't trow malformedurlexception even for a simple (in my opinion malformed) url like "http://" (it means the url is valid for him), of course something like "asdfasdf" throws exception... and i didn't find any ther posible check for java.net.URL and java.net.URI, am i wrong? and i don't want to open connections to check the url is working (what if i have no connection to internet?) – UnableToLoad Jul 16 '11 at 13:40
  • @Unable: You're mistaking whether things are well-formed with whether they are valid. The URI `http://` is well-formed, but is almost always not what you want — it will fail a validity check, but those are things that _you_ have to create. That's because validity is a much higher level concept: it's determined at the application level, not the spec level. It's _your_ app that says what schemes are required, that a non-empty hostname part is needed, etc. General URIs are much more flexible than that though (they can even lack a hostname completely; useful in some circumstances). – Donal Fellows Jul 16 '11 at 14:30
0

You might have a case of catastrophic backtracking: Check that your regex doesn't match the same characters in multiple groups, causing a runaway number of combinations that must be checked.

See this article for an explanation.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • that article is very intersting and yes, you're right, that's the problem, but the block of code is actually to hard for me to fix, cause i understand the regex tries to match as many [^\\s()<>] as it can in too many combinations (backtracking exponentially), and that's fine, but how to solve? – UnableToLoad Jul 16 '11 at 09:11