1

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):

2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434@test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)

I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/

%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?

This works when there are () within the nested string like this: Mcdonalds (CFH) Restaurant Glen

...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen

Trying to build regex to account for both scenarios with this portion of the grok pattern:

?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?

Currently this parses the non-parenthesis case like this though:

  "senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"

..where desired state is one of the following depending on the string:

"senderpartyname": "Mcdonalds Restaurant Glen"

or

"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"
nobrac
  • 372
  • 1
  • 6
  • 16
  • Try `%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}(?:\s+Id\(%{DATA:id}\))?(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?\((?(?:[^()]++|\g)*)\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?` – Wiktor Stribiżew Jun 01 '22 at 20:15
  • That doesn't seem to work unfortunately (tested in debugger) – nobrac Jun 01 '22 at 20:24
  • Aha, try `%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?\((?:[^()]++|\g)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?` now. It seems to work in the debugger. – Wiktor Stribiżew Jun 01 '22 at 20:36
  • Nice thank you! Is there a way to remove the outer parenthesis from the parsed result? I tried adding some escape characters to the pattern you provided but it seems to error out. @WiktorStribiżew – nobrac Jun 01 '22 at 21:37
  • Oniguruma is bad at this. – Wiktor Stribiżew Jun 01 '22 at 22:06
  • 1
    Can you accept a solution with parentheses in that field? I doubt there is a way to get rid of them with just regex (it is not PCRE unfortunately). – Wiktor Stribiżew Jun 03 '22 at 11:37
  • Yeah its better than getting the string cut off – nobrac Jun 03 '22 at 12:25

1 Answers1

1

You can use

%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?

Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.

The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches

  • (?: - start of a non-capturing group:
    • \s+ - one or more whitespaces
    • SenderPartyName - a fixed word
    • (?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
  • )? - end of the group, one or zero repetitions (optional)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563