14

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):

<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>

Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.

As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?

Neuron
  • 5,141
  • 5
  • 38
  • 59
Dmitry
  • 2,069
  • 4
  • 15
  • 23
  • 3
    By default `.` doesn't match line separators. It may match all characters (including line separators) if you use `Patter.DOTALL` flag. `[\s\S]` is set which includes all whitespaces \s and all non-whitespaces \S, effectively representing all characters (including line separators). – Pshemo Feb 07 '16 at 02:17
  • The trailing ? contributes nothing in both cases. – user207421 Feb 07 '16 at 07:34
  • A very related one: [*What's the difference between these RegEx*](http://stackoverflow.com/a/14648811/3832970) – Wiktor Stribiżew Feb 07 '16 at 09:19
  • Fantastic question, I'm really surprised it doesn't have more upvotes. – setholopolus Aug 06 '19 at 20:33

3 Answers3

19

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.

According to the oracle website, . matches

Any character (may or may not match line terminators)

while a line terminator is any of the following:

  • A newline (line feed) character ('\n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • A standalone carriage-return character ('\r'),
  • A next-line character ('\u0085'),
  • A line-separator character ('\u2028'), or
  • A paragraph-separator character ('\u2029).

The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

Neuron
  • 5,141
  • 5
  • 38
  • 59
4

Here is a sheet explaining all the regex commands.

Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

Neuron
  • 5,141
  • 5
  • 38
  • 59
z7r1k3
  • 718
  • 1
  • 5
  • 20
  • Yes, every \ has been double escaped. I have removed double slashes just to make it easy to read. The expression works, but stops to work as soon as I replace `[\s\S]*?` with `.*?` so difference should be there. – Dmitry Feb 07 '16 at 02:07
  • this is real expression: `<\\?xml\\s+version=\"[\\d\\.]+\"\\s*\\?>\\s*<\\s*rdf:RDF[^>]*>[\\s\\S]*?<\\s*\\/\\s*rdf:RDF\\s*>` – Dmitry Feb 07 '16 at 02:08
  • This is not true. `.` may escape new lines, depending on certain flags. Have a look at my answer for all the details.. – Neuron Feb 07 '16 at 02:15
  • 1
    @Neuron the source I quoted states that `.` will not catch newlines. That's what I was going off of. I now realize its probably not as credible as I thought. – z7r1k3 Feb 07 '16 at 02:17
0

it is like in javascript although i don't get use to java, but java is a type of program and it is very useful in our real life.

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 05 '23 at 10:43