2

I'm trying to develop a ruby based regular expression to parse http response.

My question is: how can I have all the http header names and values as captured groups? I can only capture the last one.

Here is my regex:

/^(?<statusline>HTTP\/(?<protocolversion>\d\.\d) (?<statuscode>\d+) (?<reasonphrase>[^\r\n]+))\r?\n(?:(?<headername>[\w-]+):\s*(?<headervalue>[^\r\n]*)\r?\n)*\r?\n(?<body>[\s\S]*)/m

and my test data:

HTTP/1.1 200 No Content
Date: Mon, 19 Jul 2004 16:18:20 GMT
Server: Apache
Last-Modified: Sat, 10 Jul 2004 17:29:19 GMT
ETag: "1d0325-2470-40f0276f"
Accept-Ranges: bytes
Content-Length: 9328
Connection: close
Content-Type: text/html

<HTML>
    <HEAD>
    </HEAD>
    <BODY>
    </BODY>
</HTML>

This regex parses the status line and body properly. But unfortunately I have only the last header parsed, but I would like to have all of them.

l4t3b0
  • 39
  • 1
  • 4
  • Getting a dynamic number of capture groups like this does not work like that repeating a capture group. You can capture all headers and then split on a newline `^(?HTTP\/(?\d\.\d) (?\d+) (?.*))\r?\n(?(?:[\w-]+:.*\r?\n)*)\r?\n(?[\s\S]*)` See https://regex101.com/r/KaZo55/1 – The fourth bird Mar 28 '23 at 12:53
  • You might get separate named capture groups using an if clause checking for the existence of another named group, and then in the code you could check if group body exists, which will be matches after the last consecutive match for a header `(?:^(?HTTP\/(?\d\.\d) (?\d+) (?.+))|\G(?!^))\r?\n(?[\w-]+):\s*(?.*)(?('headername')\r?\n\r?\n(?[\s\S]*))?` See https://rubular.com/r/uC70UYCjl54JLC – The fourth bird Mar 28 '23 at 13:08

1 Answers1

0

Ruby supports the \G anchor to get consecutive matches, and als supports using an if clause to test for the existence of a group.

If you don't turn on the /m flag you can match the whole line with .+ instead of [^\r\n]+

(?:^(?<statusline>HTTP\/(?<protocolversion>\d\.\d) (?<statuscode>\d+) (?<reasonphrase>.+))|\G(?!^))\r?\n(?<headername>[\w-]+):\h*(?<headervalue>.+)(?('headername')\r?\n\r?\n(?<body>[\s\S]*))?

See a Rubular demo

Another option could be to capture all the headers in a capture group, and then after process that group by splitting on a newline.

^(?<statusline>HTTP\/(?<protocolversion>\d\.\d) (?<statuscode>\d+) (?<reasonphrase>.+))\r?\n(?<headers>(?:[\w-]+:.+\r?\n)*)\r?\n(?<body>[\s\S]*)

See another Rubular demo.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70