python fix multi line log entries

Question

I need to fix some multi line log entries, currently using perl but I need to move the functionality to python.

example multi line entry :

2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944       10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET        http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (
This is the IgnitionOne Company Bot for Web Crawling.
IgnitionOne Company Site: http://www.example.com/
  ;
 rong2 dot huang at ignitionone dot com
  )" - -

Current perl script to fix these is :

while (my $row = <$fh>) {
chomp $row;
  if ( $row =~ /^(\d{4})-(\d\d)-(\d\d)T(\d)/ ) {
    print "\n" if $. != 1;
}
print $row;

which outputs the corrected single line entry :

2015-12-02T17:56:13.783276Z telepictures-elb-prod 52.20.50.51:60944  10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET   http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (    This is the IgnitionOne Company Bot for Web Crawling.    IgnitionOne Company Site: http://www.example.com/  ;    rong2 dot huang at ignitionone dot com  )" - -

So in a nutshell we're basically looking for any lines that don't begin with the date regex, if they match we're adding them to the first line without a \n.

I've seen other ways to accomplish this with awk etc, but need this to be pure python. I've looked at Python. Join specific lines on 1 line , it looks like itertools might be the preferred way to go about this?

score 1 · Accepted Answer · answered Dec 05 '15 at 15:09

You may achieve this in python through re module using negative lookahead based regex.

>>> import re
>>> s = '''2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944       10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET        http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (
This is the IgnitionOne Company Bot for Web Crawling.
IgnitionOne Company Site: http://www.example.com/
  ;
 rong2 dot huang at ignitionone dot com
  )" - -'''
>>> re.sub(r'\n(?!\d{4}-\d{2}-\d{2}T\d)', '', s)
'2015-12-02T17:56:13.783276Z our-elb-prod 52.20.50.51:60944       10.30.0.32:80 0.000024 0.063357 0.000066 200 200 0 12164 "GET        http://www.example.com:80/episodes/2014/10/ HTTP/1.0" "IgnitionOneBot/Nutch-1.9 (This is the IgnitionOne Company Bot for Web Crawling.IgnitionOne Company Site: http://www.example.com/  ; rong2 dot huang at ignitionone dot com  )" - -'

ie,

import re
with open(file) as f:
    fil = f.read()
    print re.sub(r'\n(?!\d{4}-\d{2}-\d{2}T\d)', '', s)

Thanks Avinash! works great for what I was looking for. – user129545 Dec 05 '15 at 19:20 — user129545, Dec 05 '15 at 19:20

python fix multi line log entries

1 Answers1