0

I am trying to parse some logs with non printable Unicode characters in the file:

02 Aug 2018 18:00:00,531 ^[[32m[TEXT]^[[m  (ussouth-dc2-ms-2012) This.is.test.log: Service is responding normal

How can I avoid ^[[32m and ^[[m?

^([0-9]{2}\s[A-Za-z]{3}\s[0-9]{4}\s[0-9]{2}:[0-9]{2}:[0-9]{2}(?:,[0-9]{3})?)\s(?:\^\[\[[0-9]{2}m)\[([A-Za-z]+)\](?:\^\[\[m)\s(.*)

My current regex is treating them as normal characters which seems to work fine with I copy paste in online regex tested but when I use the system regex (possibly using java) it fails to parse which is because the file contains non-printable Unicode character.

Fenomatik
  • 457
  • 2
  • 8
  • 22
  • 1
    post the **exact** error message you are getting as an **edit** to your question because more likely case is you are missing double escape sequences. to match a backslash you have to escape it twice resulting in 4 backslash characters to match a literal backslash. –  Aug 13 '18 at 18:43
  • `^[` is the character representation of an *escape* character (`\u001B`), which is a non-printable character. The actual text is `\u001B[m`, not `^[[m`. – Andreas Aug 13 '18 at 18:47
  • @Andreas How the secong set of non-capturing group and capturing group will look like (?:\^\[\[[0-9]{2}m)\[([A-Za-z]+)\](?:\^\[\[m) – Fenomatik Aug 13 '18 at 18:51
  • To match `\u001B[m`, use regex `" ... \u001B\\[m ... "` – Andreas Aug 13 '18 at 18:53
  • @Andreas this is for the last part \u001B[m matches ^[[m , what about the first part ^[[32m ? – Fenomatik Aug 13 '18 at 18:57
  • Did you write the original regex? If so, you should be able to add the extra pattern to allow `32` in the regex I showed. – Andreas Aug 13 '18 at 18:58
  • @Andreas Yes,I wrote the original regex , does this look right ```(?:\u001B\\[m\[[0-9]{2}m)\[([A-Za-z]+)\](?:\u001B\\[m\[m)``` to match ```[TEXT]``` and avoid ```^[[32m``` and `^[[m``` – Fenomatik Aug 13 '18 at 19:05
  • Do you need something that matches any ANSI escape sequence, or just those two specific ones? – Shawn Aug 13 '18 at 19:11
  • I said `\u001B\\[m`, so why do you say `\u001B\\[m\[m`? --- How did adding match for `32` (or any 2 digits) to `\u001B\\[m` become `\u001B\\[m\[[0-9]{2}m)`? --- Why the extra `\[` and the extra `m`? --- Do you even understand the constituent parts of `\u001B\\[m`, i.e. `\u001B` (match escape char), `\\[` (match open bracket), and `m` (match letter `m`)? – Andreas Aug 13 '18 at 19:12
  • @Andreas I am new to regex so trying my best , all I want is ```^[[32m``` and ```^[[m``` to be non-capturing group and only TEXT is captured. ```(?:\u001B\\[[0-9]{2}m)\[([A-Za-z]+)\](?:\u001B\\[m)``` is the corrected one – Fenomatik Aug 13 '18 at 19:25
  • Let me re-phrase: `^[[32m` is *fake*. That is *not* the text your code sees. The `^[` at the beginning is a *printable representation* of an ESCAPE character, which is a non-printable character. The Java equivalent in a string literal is `\u001B`. So, if your original regex works fine with the fake text containing `^[`, then replacing `\^\[` with `\u001B` in the regex will correct the regex to work with the *real* text. – Andreas Aug 13 '18 at 20:09

0 Answers0