12

I have a Java regex pattern and a sentence I'd like to completely match, but for some sentencecs it erroneously fails. Why is this? (for simplicity, I won't use my complex regex, but just ".*")

System.out.println(Pattern.matches(".*", "asdf"));
System.out.println(Pattern.matches(".*", "[11:04:34] <@Aimbotter> 1 more thing"));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));

Output:

true
true
true
false

Note that the fourth sentence contains 10 unicode control characters \u0085 in between the question marks, which aren't shown by normal fonts. The third and fourth sentences actually contain the same amount of characters!

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Zom-B
  • 233
  • 1
  • 6

4 Answers4

13

use

Pattern.compile(".*",Pattern.DOTALL)

if you want . to match control characters. By default it only matches printable characters.

From JavaDoc:

"In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"

Code in Pattern (there is your \u0085):

/**
 * Implements the Unicode category ALL and the dot metacharacter when
 * in dotall mode.
 */
static final class All extends CharProperty {
boolean isSatisfiedBy(int ch) {
    return true;
}
}

/**
 * Node class for the dot metacharacter when dotall is not enabled.
 */
static final class Dot extends CharProperty {
boolean isSatisfiedBy(int ch) {
    return (ch != '\n' && ch != '\r'
                && (ch|1) != '\u2029'
                && ch != '\u0085');
    }
}
rurouni
  • 2,315
  • 1
  • 19
  • 27
  • Thanks, (?s) worked. I didn't try Pattern.DOTALL because I have a ton of different compiled patterns, and I only had to use (?s) once (in a string constant that I include in most patterns). – Zom-B May 12 '11 at 17:12
4

The answer is in the question : 10 unicode control characters \u0085

unicode control characters arent recognized by .* just like \n

djfoxmccloud
  • 571
  • 1
  • 9
  • 23
2

Unicode /u0085 is newline - so you have to either add (?s) - dot matches all - to the beginning of your regex or add the flag when compiling the regex.

Pattern.matches("(?s).*", "blahDeBlah\u0085Blah")
josh.trow
  • 4,861
  • 20
  • 31
  • 1
    Not `(?m)`- Multiline mode means that `^` and `$` match at start/end of lines. You want `(?s)` for singleline mode. Yes, it is confusing (the idea is to "treat the entire input as if it were a single line"). – Tim Pietzcker May 12 '11 at 13:34
1

The problem I believe is that \u0085 represents a newline. If you want multiline matching you need to use Pattern.MULTILINE or Pattern.DOTALL. It's not the fact it is Unicode - '\n' would fail too.

To use it:Pattern.compile(regex, Pattern.DOTALL).matcher(input).matches()

Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134