0

I have the following two patterns to match an html tag name with possible leading spaces. The first pattern where [ ]* is inside the named group <doubletag> doesn't work, but the second pattern where [ ]* is immediately following the tag symbol "<" works. I don't know why the first doesn't work.

 String s = "<      pre href = "dajflka" >ld fjalj09u293 ^% </pre>";
 Pattern ptr = Pattern.compile("(<(?<doubletag>[ ]*[a-z]+)([ \\d\\s\\w\\W[^>]])*>)(.*)(</\\k<doubletag>[ ]*>)");
 Pattern ptr = Pattern.compile("(<[ ]*(?<doubletag>[a-z]+)([ \\d\\s\\w\\W[^>]])*>)(.*)(</\\k<doubletag>[ ]*>)");
 Matcher match = ptr.matcher(s);
 if(match.find()){
        System.out.println("Found");
  }
xingbin
  • 27,410
  • 9
  • 53
  • 103
Abdelrahman
  • 525
  • 1
  • 4
  • 13
  • Possible duplicate: https://stackoverflow.com/questions/4731055/whitespace-matching-regex-java – lexicore Apr 14 '18 at 15:36
  • 1
    Parsing HTML with RegExp is not that precise, you should used something like [jsoup](https://jsoup.org/) for this kind of thing. – Titus Apr 14 '18 at 15:37
  • @Titus I know. I just was solving a problem with regex in hackerrank. Thanks! – Abdelrahman Apr 14 '18 at 16:15

2 Answers2

0

\s is the white space , if that is what you want , put [\s]*.

parlad
  • 1,143
  • 4
  • 23
  • 42
  • The problem isn't about the way [ ]* or \\s* doesn't matter. The problem is that the first pattern doesn't match the white space although I accounted for it in the group "<(?[ ]*...." ,but when I use the try the second pattern where I move the [ ]* before the group as "<[ ]*(?...." it works fine. To me, both looks similar. So, why the first doesn't work? – Abdelrahman Apr 14 '18 at 16:20
0

Actually, the first pattern can find the leading white spaces. If you try the first group of the first pattern only:

String s = "<      pre href = \" dajflka \" >";
Pattern pattern = Pattern.compile("<(?<doubletag>[ ]*[a-z]+)([ \\d\\s\\w\\W[^>]])*>");
Matcher match = pattern.matcher(s);
if (match.find()) {
    System.out.println("Found");
    System.out.println(match.group("doubletag"));
}

you will get doubletag:

"      pre"

The problem is, in </pre>, it does not have the prefix whitespaces, so group (</\k<doubletag>[ ]*>) can not be found. That's why the first pattern can not match the whole string.

xingbin
  • 27,410
  • 9
  • 53
  • 103