Extract Splunk domain from payload_printable field with regex

Question

I'm trying to extract a domain from the Splunk payload_printable field (source is Suricata logs) and found this regex is working fine for most of the cases:

source="*suricata*" alert.signature="ET JA3*" 
| rex field=payload_printable "(?<dom>[a-zA-Z0-9\-\_]{1,}\.[a-zA-Z0-9\-\_]{2,}\.[a-zA-Z0-9\-\_]{2,})"
| table payload_printable, dom

The regular expression is:

(?<dom>[a-zA-Z0-9\-\_]{1,}\.[a-zA-Z0-9\-\_]{2,}\.[a-zA-Z0-9\-\_]{2,})

For example, if my printable_payload looks like this:

...........^aO+.t....]......$.....mT*l.......&.,.+.0./.$.#.(.'.
...........=.<.5./.
...].........activity.windows.com..........
.................
.......................#...........

The domain "activity.windows.com" is successfully extracted. Now, it doesn't work for such a payload, because the regex matches another part that does not correspond to the domain:

...........^aO+]v;.~........:.Y.zORw._I..K>..&.,.+.0./.$.#.(.'.
...........=.<.5./.
...].........activity.windows.com..........
.................
.......................#...........

It extracts "Y.zORw._I".

Another example:

...........^h.'`.o2...
.y.k>..e.ef...]..8.G..&.,.+.0./.$.#.(.'.
...........=.<.5./.
...p.........arc.msn.com..........
.................
.......................#.........h2.http/1.1...................

I don't know how to do. Thank you for your help.

Also to be noticed that my regex is probably not optimal as a domain could be in the form of "abc.de" while it's currently searching for the form "ab.cde.fg" or "a-b.c_def.ghi.jk". I'm not sure exactly how to do, probably with a combination with optional items and negative lookahead? — Sebastien Damaye, Mar 08 '20 at 08:26
How is a parser to know the domain is 'activity.windows.com' and not 'Y.z0Rw._I'? Is the domain field position-dependent? Is it always preceded or followed by a specific character? — RichG, Mar 08 '20 at 13:29
removing the \_ from near the end resolves the match for the non-domain in the example given (https://regex101.com/r/mmIPnn/1). However, for all valid domains you may need a whopper of a regex. Just an example: (https://regex101.com/r/7tXksJ/1) — MDR, Mar 08 '20 at 23:15

score 1 · Answer 1 · answered Mar 11 '20 at 10:29

1

This regex will match domain names and correctly matches the two examples you gave:

"(?<dom>(?:[a-z0-9](?:[a-z0-9-_]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-_]{0,61}[a-z0-9])"

answered Mar 11 '20 at 10:29

steoleary

8,968
2
33
47

Thanks a lot for your help. That solves many cases but not all of them. Example for ...........^h.'`.o2... .y.k>..e.ef...]..8.G..&.,.+.0./.$.#.(.'. ...........=.<.5./. ...p.........arc.msn.com.......... ................. .......................#.........h2.http/1.1................... resolves to "e.ef" – Sebastien Damaye Mar 11 '20 at 17:00
I feel like I should "ignore" the beginning of the payload to ensure only the relevant part of the payload is taken by the regex – Sebastien Damaye Mar 11 '20 at 17:01

Extract Splunk domain from payload_printable field with regex

1 Answers1