Why matcher.find() is not giving any result? Why it freezes?

Question

I am creating an email scraper. But when I tried with one particular URL matcher.find() is not giving any boolean result. As I see it freezes. But for some other URLs the code is working fine.

Here is my code,

private Matcher matcher;
private Pattern pattern = null;
private final String emailPattern = "([\\w\\-]([\\.\\w])+[\\w]+@([\\w\\-]+\\.)+[A-Za-z]{2,4})";

public void scrape() {
   pattern = Pattern.compile(emailPattern);

   Document documentTwo = null;

   try {
      documentTwo = Jsoup.connect("https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
              .ignoreHttpErrors(true)
              .userAgent(RandomUserAgent.getRandomUserAgent())
              .header("Content-Language", "en-US")
              .get();
   } catch (IOException ex) {
     break;
   }

   String pageBody = documentTwo.toString();

   matcher = pattern.matcher(pageBody);

   while (matcher.find()) {
      // this will never execute for the above web address
   }
}

To check I have added System.out.println(matcher.find()); above the while loop and it stucks there without printing any value. So what I am doing wrong here? I have tried with many different email regex patterns but the above pattern is the working one.

Is it the same problem https://stackoverflow.com/questions/9687596/slow-regex-performance ? — Benjamin Eckardt, Jun 03 '20 at 15:48
@BenjaminEckardt But why this is for just a one website? Currently the issue is with this website. — H Athukorala, Jun 03 '20 at 15:56
It's quite some content you have there. Maybe you can reduce `pageBody`, and split it in order to find out if that gives you a clue about any extraordinary parts in the content. — Benjamin Eckardt, Jun 03 '20 at 16:02
`pageBody` is more than 245000 characters, and with the double match-many (`([\\.\\w])+[\\w]+`) there is a lot of backtracking going on. Removing that last `+` might help, e.g. use `"[\\w\\-][\\.\\w]+\\w@[\\w\\-\\.]+?\\.[A-Za-z]{2,4}"` — Andreas, Jun 03 '20 at 16:04
This is also a regexp which by far won't find all valid e-mail addresses (i.e. a trailing `-` in the account is allowed). Seeing `\\w` put into `[]` and escaping `.` in `[]` indicates you don't understand regexp. — AmigoJack, Jun 03 '20 at 16:13
You might be ReDoS-ing yourself here. I'd recommend using `[^\s@]+@[^\s@]+` first which should perform better, then possibly filter out the false positives with a more precise regex. Matching email addresses with a regex is a tough job anyway. For instance, the regex you're using won't match valid addresses like `foo+bar@example.org`. — sp00m, Jun 03 '20 at 16:20

score 1 · Accepted Answer · answered Jun 03 '20 at 16:11

There is some problem with your regex. Given below is the code with the working regex:

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {
    public static void main(String[] args) {
        Document documentTwo = null;
        try {
            documentTwo = Jsoup
                    .connect(
                            "https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
                    .header("Content-Language", "en-US").get();
        } catch (IOException e) {
            e.printStackTrace();
        }

        String pageBody = documentTwo.toString();
        Pattern pattern = Pattern.compile(
                "([a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}\\@[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}(\\.[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25})+)");
        Matcher matcher = pattern.matcher(pageBody);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com

This regexp makes the same mistake of unnecessarily escaping characters in a class (`\.` inside `[]`), is redundant (two `+` inside same `[]`) and is technically wrong either (account part is 64 octets at max, see https://tools.ietf.org/html/rfc5321#section-4.5.3.1.1) — AmigoJack, Jun 03 '20 at 16:40
Thank you very much for the quick and clear answer and I really appreciate putting your time and effort to answer this question. Thank you very, much. — H Athukorala, Jun 09 '20 at 14:30

Why matcher.find() is not giving any result? Why it freezes?

1 Answers1