1

I am creating an email scraper. But when I tried with one particular URL matcher.find() is not giving any boolean result. As I see it freezes. But for some other URLs the code is working fine.

Here is my code,

private Matcher matcher;
private Pattern pattern = null;
private final String emailPattern = "([\\w\\-]([\\.\\w])+[\\w]+@([\\w\\-]+\\.)+[A-Za-z]{2,4})";

public void scrape() {
   pattern = Pattern.compile(emailPattern);

   Document documentTwo = null;

   try {
      documentTwo = Jsoup.connect("https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
              .ignoreHttpErrors(true)
              .userAgent(RandomUserAgent.getRandomUserAgent())
              .header("Content-Language", "en-US")
              .get();
   } catch (IOException ex) {
     break;
   }

   String pageBody = documentTwo.toString();

   matcher = pattern.matcher(pageBody);

   while (matcher.find()) {
      // this will never execute for the above web address
   }
}

To check I have added System.out.println(matcher.find()); above the while loop and it stucks there without printing any value. So what I am doing wrong here? I have tried with many different email regex patterns but the above pattern is the working one.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
H Athukorala
  • 739
  • 11
  • 32
  • 2
    Is it the same problem https://stackoverflow.com/questions/9687596/slow-regex-performance ? – Benjamin Eckardt Jun 03 '20 at 15:48
  • @BenjaminEckardt But why this is for just a one website? Currently the issue is with this website. – H Athukorala Jun 03 '20 at 15:56
  • 1
    It's quite some content you have there. Maybe you can reduce `pageBody`, and split it in order to find out if that gives you a clue about any extraordinary parts in the content. – Benjamin Eckardt Jun 03 '20 at 16:02
  • 2
    `pageBody` is more than 245000 characters, and with the double match-many (`([\\.\\w])+[\\w]+`) there is a lot of backtracking going on. Removing that last `+` might help, e.g. use `"[\\w\\-][\\.\\w]+\\w@[\\w\\-\\.]+?\\.[A-Za-z]{2,4}"` – Andreas Jun 03 '20 at 16:04
  • 1
    This is also a regexp which by far won't find all valid e-mail addresses (i.e. a trailing `-` in the account is allowed). Seeing `\\w` put into `[]` and escaping `.` in `[]` indicates you don't understand regexp. – AmigoJack Jun 03 '20 at 16:13
  • You might be ReDoS-ing yourself here. I'd recommend using `[^\s@]+@[^\s@]+` first which should perform better, then possibly filter out the false positives with a more precise regex. Matching email addresses with a regex is a tough job anyway. For instance, the regex you're using won't match valid addresses like `foo+bar@example.org`. – sp00m Jun 03 '20 at 16:20

1 Answers1

1

There is some problem with your regex. Given below is the code with the working regex:

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {
    public static void main(String[] args) {
        Document documentTwo = null;
        try {
            documentTwo = Jsoup
                    .connect(
                            "https://www.mercurynews.com/2020/03/21/how-can-i-get-tested-for-covid-19-in-the-bay-area/")
                    .header("Content-Language", "en-US").get();
        } catch (IOException e) {
            e.printStackTrace();
        }

        String pageBody = documentTwo.toString();
        Pattern pattern = Pattern.compile(
                "([a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}\\@[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}(\\.[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25})+)");
        Matcher matcher = pattern.matcher(pageBody);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
fkelliher@bayareanewsgroup.com
lkrieger@bayareanewsgroup.com
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110
  • This regexp makes the same mistake of unnecessarily escaping characters in a class (`\.` inside `[]`), is redundant (two `+` inside same `[]`) and is technically wrong either (account part is 64 octets at max, see https://tools.ietf.org/html/rfc5321#section-4.5.3.1.1) – AmigoJack Jun 03 '20 at 16:40
  • Thank you very much for the quick and clear answer and I really appreciate putting your time and effort to answer this question. Thank you very, much. – H Athukorala Jun 09 '20 at 14:30