1

I'm trying to parse a web log with regular expressions using RegexSerDe. It works by matching each regex group with a column in a table and if the regex group is empty it assigns a null to that column.

I'm having trouble matching log rows with missing fields. There are two kinds of rows in this log:

<134>2016-10-23T23:59:59Z cache-iad2134 fastly[502801]: 52.55.94.131 "-" "-" Sun, 23 Oct 2016 23:59:59 GMT GET /apps/events/2016/10/11/3062653/?REC_ID=3062653&id=0 200

<134>2016-10-23T23:59:59Z cache-dfw1835 fastly[502801]: 1477267199

I wrote the below regex that matches the first type of row with all fields:

^(\\S+) (\\S+) (\\S+) (\\S+) "(\\S+)" "(\\S+)" (.*) (\\d{3})

But I played around with ? to get the regex to optionally ignore the fields after the first 4 but kept messing up the columns.

Any suggestions on how I should add the ? without changing the number of groups (so that the deserializer doesn't cough up)? Or any other way to do this you would suggest?

dtolnay
  • 9,621
  • 5
  • 41
  • 62
Mete Kural
  • 11
  • 1
  • Since you haven't shown the regexp with the optional modifiers, how are we supposed to tell you what you did wrong? The only thing I can think of is that you forgot to make the spaces between the fields optional as well. – Barmar Oct 28 '16 at 01:27

1 Answers1

1

Put a non-capturing group around all the fields after the first 4, and make it optional.

^(\\S+) (\\S+) (\\S+) (\\S+)(?: "(\\S+)" "(\\S+)" (.*) (\\d{3}))?

Putting ?: at the beginning of a group makes it non-capturing. So this group doesn't affect the number of groups that are captured.

Barmar
  • 741,623
  • 53
  • 500
  • 612