0

Using a PRCE Regex, I want to capture each field of different apache weblogs. The structure of these logs is like this example:

aaa bbb "cc c" ddd "eee" fff

Each field is seperated by a space. But fields may also contain spaces in which case they are held together by quotes at the beginning and the end of the field("cc c"). Fields not containing spaces my also have quotes at the beginning and the end of the field ("eee").

The result should have a capture group for each field so for the example that should be: Group1: aaa Group2: bbb Group3: "cc c" Group4: ddd Group5: "eee" Group6: fff

My problem is that I want a one-fits-all solution, e.g. witha a quantifier - something like this: (?:((aa|bb|"cc"|dd)\s){1,})

But here the quantifier always repeats at aaa .

A tidy, working solution is much appreciated.

  • 1
    Is it possible to approach the problem using a parser (in your case it would be space-delimited values) instead of regular expressions? – rink.attendant.6 Jul 15 '15 at 15:42

1 Answers1

0

I understand you're using PCRE, the question is what actual tool are you using to process the regex.

Assuming you use perl itself, let's study what a field is made up of ?

  1. Starts with optional open double quote "
  2. Any character not a double quote
  3. A closing "

In regex the above expression looks like this:

"?[^"]+"?

You can then, optionally, quantify the above and specify how many columns you have:

("?[^"]+"?){1,6}

The above says allow 1 to 6 of such fields, the question just becomes how to apply/use the regex? That depends on the tool, in perl it could look like:

@groups = $apache_line =~ m/("?[^"]+"?)/g

From here $groups[0] would have aaa $group[1]: bbb ... $group[5]: fff

The above works because the m// operator is in list context

lzc
  • 919
  • 7
  • 16
  • thanks for your answer. I attempt to apply the regex in a tool called splunk. It requires a PCRE and the you can assign the groups to fields, e.g. USER_IP=$1, APACHE_REQUEST=$2 etc. So recall is no the problem. But for your first regex part "?[^"]+"? - Note that some logs may not contain a double quote at all, especially if they have only simple entries with no spaces at all. Also, I don't see how you address the spaces as delimiting characters. Can you please just state a complete regex solving the problem? Thanks so much. – Jonas Wagner Jul 16 '15 at 09:05