0

I've been struggling to get a regex string working. It's being used for Promtail to parse labels from my logs. The problem I'm having is it's not working with positive lookahead (because I think promtail is written in go?)

Anyway the logs are web logs and here are a few examples:

INFO:     172.0.0.1:0 - "POST /endpoint1/UNIQUE-ID?key=unique_value HTTP/1.1" 200 OK
INFO:     172.0.0.2:0 - "GET /endpoint/health HTTP/1.1" 200 OK
172.0.0.1:0 - - [04/Mar/2022:10:52:10 -0500] "GET /endpoint2/optimize HTTP/1.1" 200 271
INFO:     172.0.0.3:0 :0 - "GET /endpoint3?key=unique_value HTTP/1.1" 200 OK

Another thing worth pinting out is that the UNIQUE-ID is going to be a VIN ID (vehicle identification number)

The groups I'm looking to create are: ip request endpoint status. However, because of all the UNIQUE_ID in endpoint1 and the unique_values in endpoint1 and endpoint3, using the full endpoint path causes too many streams in loki and essentially kills it.

My solution regex looks like this:

(?P<ip>((?:[0-9]{1,3}\.){3}[0-9]{1,3})).+(?P<request>(GET|POST|HEAD|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)).(?P<endpoint>(.+endpoint1\/health)|(.+endpoint1)|(.+)(\?)|(.+) ).+\".(?P<status>([0-9]{3}))

And it captures the following groups:

ip: `172.0.0.1`, `172.0.0.2`, `172.0.0.1` `172.0.0.3`
request: `POST`, `GET`, `GET`, `GET`
endpoint: `/endpoint1`, `/endpoint1/health`, `/endpoint2/optimize `, `/endpoint3?`
status: `200`,`200`,`200`,`200`

The problem is the endpoints for /endpoint2/optimize and /endpoint3?. endpoint2 has a trailing space at the end and endpoint3 includes the ?. I was able to get this working using positive lookahead with the following regex, but it throws an error in Promtail.

(?P<ip>((?:[0-9]{1,3}\.){3}[0-9]{1,3})).+(?P<request>(GET|POST|HEAD|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)).(?P<endpoint>(.+endpoint1\/health)|(.+endpoint1)|(.+)(?=\?)|(.+)(?= )).+\".(?P<status>([0-9]{3}))

Any help would be greatly appreciated! I am far from pretending like I know my way around regex...

EDIT: Here is an example https://regex101.com/r/FXvnqR/1

wymangr
  • 189
  • 3
  • 16
  • I'll not pretend I completely understand your use case, but how about throwing in a couple of negated character classes to make more explicit what you do not want. (I also shortened a little, but that's for my own benefit, mainly. ^^;) For example: `(?P(\d+(?:\.\d+)+))[^\"]*\"(?P(GET|POST|HEAD|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)).(?P([^\/]*\/endpoint1(?:\/health)?)|([^"? ]+))[^\"]*\".(?P(\d{3}))`. – oriberu Mar 04 '22 at 21:52
  • Thanks for helping out! This works, all except for the `/endpoint1/UNIQUE_ID`. I need to have that be just `/endpoint`. Also, if it helps, that ID is a VIN number. – wymangr Mar 04 '22 at 21:58

1 Answers1

1

EDIT

Try this! (?P<ip>((?:[0-9]{1,3}\.){3}[0-9]{1,3})).+(?P<request>(GET|POST|HEAD|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)).(?P<endpoint>(/endpoint[1-3]?(?:\/health|\/optimize)?))?.+\".(?P<status>([0-9]{3}))

https://regex101.com/r/DKqRpL/1

if there are going to be endpoints that include numbers other than 1-3 or subsequent routing other than health or optimize this will need to be edited, but as of now this is your fix bud

bcstryker
  • 456
  • 3
  • 15
  • that didn't work. I was still left with `/endpoint3?key=unique_value HTTP/1.` and `/endpoint2/optimize HTTP/1.`. And just to make sure I captured your comment correctly, this is the full updated regex: (?P((?:[0-9]{1,3}\.){3}[0-9]{1,3})).+(?P(GET|POST|HEAD|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH)).(?P(.+endpoint1\/health)|(.+endpoint1)|(.+)).+\".(?P([0-9]{3})) – wymangr Mar 04 '22 at 20:53
  • Example: https://regex101.com/r/FXvnqR/1 – wymangr Mar 04 '22 at 21:12
  • Unfortunately that didn't work either :( https://regex101.com/r/6DU2eg/1 – wymangr Mar 04 '22 at 21:19
  • That get's me really close! The only issue that I see now is the `/endpoint1/UNIQUE-ID`. I just need to remove the `/UNIQUE-ID` from the group. Thank you for the help on this! – wymangr Mar 04 '22 at 21:25
  • Man, you are a magician at this! So the `UNIQUE-ID` is going to be a vehicle VIN number. – wymangr Mar 04 '22 at 21:36
  • That's what I was just looking into. They can start with numbers 0-9, lowercase letters or capital letters. They will be a mix of numbers/letters. And from what I've been told, they will only be either 17 or 10 characters in length! Thanks again for helping on this!! – wymangr Mar 04 '22 at 21:48
  • Yeah, I'm getting a headache trying to get this to work! I'm about to call it quits and forget about the endpoint tag all together! I'm not a regex fan for sure! – wymangr Mar 04 '22 at 21:52
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242612/discussion-between-bcstryker-and-wymangr). – bcstryker Mar 04 '22 at 22:01