6

This question is a follow-up for the following post: Javascript regex: Find all URLs outside <a> tags - Nested Tags

I discovered that the code:

\b((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

is extremely inefficient compared to executing it separately for http and ftp part like this:

\b(https?:\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

and

\b(ftps?:\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Here are examples at regex101.com:

However, in one of my HTML page these codes compares as 85628 steps vs. 7258 + 795 steps, that is quite insane.

As far as I have seen, using (x|y) pattern reduces the execution length but here probably for a strange reason it is otherwise.

Any help would be appreciated.

Community
  • 1
  • 1
Klaidonis
  • 559
  • 2
  • 6
  • 22

2 Answers2

3

It seems that you are a victim of catastrophic backtracking.

This regex does the trick in just 3492 steps:

\b(?>(https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

All I have done is made the first group an atomic group, causing the engine to discard all backtracking options once it's matched it.

That's correct in your case: you can think of it now as two parts, "find a URL" then "Use the negative lookahead to decide if we want to keep it". Your original regex would, when the lookahead failed, proceed to backtrack into the url-matching expression. The [^"<\s]+ block would yield some symbols, then it would try the lookahead again, then yield some more symbols, and try again, and so on...

The reason the addition of the https?|ftps? part made it so much worse was that this provides an extra source of backtracking (losing the optional s) in a way that allows all the later backtracking to happen all over again.

You know that regex101.com has a "regex debugger" option on the toolbar on the left? If you use that, it explains how your regex matches, so you can (as I just did) figure out where the crazy backtracking is.

Bonus edit: A further improved one that only takes 3185 steps:

\b(?>ht|f)tps?:\/\/(?>[^"<\s]+)(?![^<>]+>|[^"]*?<\/a)
Chris Kitching
  • 2,559
  • 23
  • 37
  • Thank you very much! I was aware of the catastrophic backtracking that I also found with the help of debugger but couldn't solve it. – Klaidonis Mar 12 '16 at 11:30
  • The trick is to figure out which parts of your regex can survive becoming atomic like this. Your example is an easy one, but many regexes require lots of backtracking to work right. Sometimes it's possible to rejig the regex a bit to make more amenable to this trick. – Chris Kitching Mar 12 '16 at 11:31
0

if you looking for find all links in document than solution is this. it return an array

document.anchors
Naresh Teli
  • 138
  • 1
  • 12