0

I am using positive lookahead regular expression in java to tokenize email addresses. I need to tokenize the email address(for example John.doe@abc.co.in) like this doe@abc.co.in, doe@abc.co.in, @abc.co.in, abc.co.in, .co.in, co.in, .in, in

I am using the following regex to tokenize email address

(?=([\@|\.|\!|\#|\$|\%|\&|\'|\*|\+|\-|\/|\=|\?|\^|\_|\`|\{|\||\}|\~](.+)))

This regex works perfectly and gives the result. Is there any possibility for catastrophic backtracking at some point of time while using this regex. If there is a possibility for catastrophic backtracking, what is the alternative solution to tokenize email addresses?

Yaqoob Bhatti
  • 1,271
  • 3
  • 14
  • 30
  • *"tokenize email address"*: can you provide examples (more than one, please) of input and expected output? – trincot Nov 26 '22 at 11:43
  • pop it in regex101.com, give it a sample which it should FAIL and one which is very very long in the domain part, and check if a failure occurs – akash Nov 26 '22 at 14:25
  • As long as you don't use nested quantified groups like e.g. [`((a*b*)*c)*`](https://regex101.com/r/iV6ugK/1) or [`(?=((a*b*)*c))`](https://regex101.com/r/h3Z4MG/1) I don't think you will run into [any issues](https://www.rexegg.com/regex-explosive-quantifiers.html). Further have a look in [charater classes](https://www.regular-expressions.info/charclass.html). You don't need to alternate between characters inside a class. It's a *defined set* of characters. Also no need for so much escaping: [`(?=([@.!x#$%&'*+=?\`{}_~\/\^\-](.+)))`](https://regex101.com/r/gCJLfj/1) will suffice. – bobble bubble Nov 26 '22 at 15:11
  • These regexes that are triggered at any position in a string are certainly costly and can still [soon time out](https://regex101.com/r/wrrIgM/1) if you use them on *too long input* so best to precheck length. Afaik that's not what is meant with *catastophic backtracking*. Whenever indeed dealing with exponential growth the best friends to prevent [runaway regexes](https://www.regular-expressions.info/toolong.html) are [possessive quantifiers](https://www.regular-expressions.info/possessive.html) and [atomic groups](https://www.regular-expressions.info/atomic.html). – bobble bubble Nov 26 '22 at 15:35
  • To make your pattern a bit more efficient, you can move the separator-part out of the lookahead, e.g. [`([@.!x#$%&'*+=?\`{}_~\/\^\-])(?=(.+))`](https://regex101.com/r/gCJLfj/2) – bobble bubble Nov 26 '22 at 15:53

0 Answers0