8

I am looking for a regex pattern matcher for a String in HttpLogFormat. The log is generated by haproxy. Below is a sample String in this format.

Feb 6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} "GET /index.html HTTP/1.1"

An explanation of the format is available at HttpLogFormat. Any help is appreciated.

I am trying to get the individual peices of information included in that line. Here are the fields:

  1. process_name '[' pid ']:'
  2. client_ip ':' client_port
  3. '[' accept_date ']'
  4. frontend_name
  5. backend_name '/' server_name
  6. Tq '/' Tw '/' Tc '/' Tr '/' Tt*
  7. status_code
  8. bytes_read
  9. captured_request_cookie
  10. captured_response_cookie
  11. termination_state
  12. actconn '/' feconn '/' beconn '/' srv_conn '/' retries
  13. srv_queue '/' backend_queue
  14. '{' captured_request_headers* '}'
  15. '{' captured_response_headers* '}'
  16. '"' http_request '"'
Thimmayya
  • 2,064
  • 2
  • 18
  • 20
  • 1
    What are you trying to parse from this line? It's one thing to match it, it's another thing to get particular pieces of information from it. – eldarerathis Oct 29 '10 at 20:00
  • But what are you wanting to get from the line? – Keng Oct 29 '10 at 20:01
  • 1
    It really depends on what you want to match. All of the information, or only part of it? – jordanbtucker Oct 29 '10 at 20:01
  • sorry guys .. just took me a while to add those 16 lines. HAproxy generates the log in this format. I just want to parse the data efficiently. There is a detailed explanation of the format on the HttpLogFormat link posted in the question. – Thimmayya Oct 29 '10 at 20:07

5 Answers5

5

Regex:

^(\w+ \d+ \S+) (\S+) (\S+)\[(\d+)\]: (\S+):(\d+) \[(\S+)\] (\S+) (\S+)/(\S+) (\S+) (\S+) (\S+) *(\S+) (\S+) (\S+) (\S+) (\S+) \{([^}]*)\} \{([^}]*)\} "(\S+) ([^"]+) (\S+)" *$

Results:

Group 1:    Feb 6 12:14:14
Group 2:    localhost
Group 3:    haproxy
Group 4:    14389
Group 5:    10.0.1.2
Group 6:    33317
Group 7:    06/Feb/2009:12:14:14.655
Group 8:    http-in
Group 9:    static
Group 10:   srv1
Group 11:   10/0/30/69/109
Group 12:   200
Group 13:   2750
Group 14:   -
Group 15:   -
Group 16:   ----
Group 17:   1/1/1/1/0
Group 18:   0/0
Group 19:   1wt.eu
Group 20:   
Group 21:   GET
Group 22:   /index.html
Group 23:   HTTP/1.1

I use RegexBuddy for composing complex regular expressions.

Mike Clark
  • 10,027
  • 3
  • 40
  • 54
2

Use at your own peril.

This assumes that all fields return something except for the ones you have marked with asterisks (is that what the asterisk means)? There are also obvious failure cases such as nested brackets of any kind, but if the logger prints reasonably sane messages, then I guess you'd be okay...

Of course, even I personally wouldn't want to have to maintain this, but there you have it. You might want to consider writing a regular ol' parser for this instead, if you can.

Edit: Marked this as CW since it's more of a "I wonder how this will turn out" kind of answer than anything else. For quick reference, this is what I ended up constructing in rubular:

^[^[]+\s+(\w+)\[(\d+)\]:([^:]+):(\d+)\s+\[([^\]]+)\]\s+[^\s]+\s+(\w+)\/(\w+)\s+(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d*)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)\s+(\d+)\/(\d+)\s+\{([^}]*)\}\s\{([^}]*)\}\s+\"([^"]+)\"$

My first programming language was Perl, and even I'm willing to admit that I'm frightened by that.

eldarerathis
  • 35,455
  • 10
  • 90
  • 93
  • +1 just for putting that nasty thing out! I'll try it out and update how it goes. – Thimmayya Oct 29 '10 at 20:46
  • Thanks for the solution. It works okay for the most part. Mike's solution above works better and the regex is simpler and more flexible. I used rubular to tweak the regex and it is a nice tool. – Thimmayya Nov 04 '10 at 22:39
1

That looks like a very complicated string to match on. I would recommend using a tool like Expresso. Start with the string you are trying to match then start replacing pieces of it with Regex notation.

To grab individual pieces, use grouping parentheses.

The other option would be to make a regex for each piece you are trying to grab.

Seattle Leonard
  • 6,548
  • 3
  • 27
  • 37
1

Why are you trying to match the line precisely ? If you're looking for specific fields in it, better specify which ones and extract them. If you want to run statisticts on haproxy logs, you should take a look at the "halog" tool in the "contrib" directory in the sources. Take the one from version 1.4.9, it even knows how to sort URLs by response time.

But whatever you want to do with those lines, regex will probably always be the slowest and most complex solution.

Willy Tarreau
  • 3,384
  • 1
  • 17
  • 11
0

I don't think regex is your best option here...however, if it's your ONLY option...

Try looking at these options instead. https://serverfault.com/q/62687/438

Community
  • 1
  • 1
Keng
  • 52,011
  • 32
  • 81
  • 111