0

Python newb here, please excuse the dumb question. I am trying to extract log data from inside of a group of gzipped files. The data spans multiple lines so I am trying to extract each file from its compressed tar file and read it as a single object like this: Regex:

first_match = re.compile(r"(?P<date>\d{4}[-]?\d{1,2}[-]?\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}).*?http://servername:99999/chargeit.*?manager_event=first.*?\bwantThisUser=([^&]*).*?\b_operator=(\w+).*?request\:.*?Want-To-Have-This\:\s\*123\*0\#")

 tfile = tarfile.open("logfile-year-month-day.number.log.tar.gz", "r")
     for filename in tfile.getmembers():
          f = tfile.extractfile(filename).read()
          f = str(f)
          for match in first_match.finditer(f):
              linecount = linecount + 1
              print(linecount, match.group(1), match.group(2), match.group(3))

I am trying to match the timestamp, and two other groups in the log file. Log data looks somewhat like this, if printed line by line:

2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
 HEADERS:
  this-is-a-header: 200
  Want-To-Have-This: *123*200#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
 HEADERS:
  this-is-a-header: 0
  Want-To-Have-This: *123*0#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

2016-12-16 20:26:29 DEBUG[ispatcher-12563] this.is.the.api.Api - http://servername:99999/chargeit?session_id=a5e456ad2f5645c39a580463630cd3db&manage_event=first&wantThisUser=4119023107960&_source=operator2 1021c087-1918-40a3-a7c1-4b7c37690471 request:
 HEADERS:
  this-is-a-header: 1000*0111111111
  Want-To-Have-This: *123*1000*0111111111#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

I am expecting to catch this:

    2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
 HEADERS:
  this-is-a-header: 0
  Want-To-Have-This: *123*0#

And the groups I'm hoping to capture are the timestamp: (2016-12-16 20:43:4), the value of wantThisUser= (4119185011005) and _operator= (operator4).

Instead the regex captures the target line, and the one(s) above it:

2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
 HEADERS:
  this-is-a-header: 200
  Want-To-Have-This: *123*200#
  Host: servername:99999
  Accept: */*
  User-Agent: AHC/2.0
  Timeout-Access: <function1>
 CONTENT:

2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
 HEADERS:
  this-is-a-header: 0
  Want-To-Have-This: *123*0#

And it pulls the timestamp and the other two groups from the line(s) above the desired match. Please how do I restrict the match to its own line? Or am I approaching this the wrong way?

Unpossible
  • 603
  • 6
  • 23
  • There's no need to apologize for your question :) – blubberdiblub Jan 23 '17 at 10:53
  • I would probably tackle this problem step by step, at multiple levels of your data, not all with a regex. First I would split the log data into records / log entries, then I would take the first line, I would apply the regex to the first line which extracts the time stamp and the URI as a whole. And then I would use a library to parse the URI and its query arguments into a dictionary. Then I would access wantThisUser and _operator by indexing into the dictionary. – blubberdiblub Jan 23 '17 at 11:01
  • The problem is how the log lines are ordered. The 'Want-To-Have-This:(.*)' line is what I am looking for and it has several string forms that I separate data by. When I have that line, I now need the groups in the URI to tell me when the operation occurred. That's why I'm trying to collect it one piece. – Unpossible Jan 23 '17 at 13:25
  • Well, with record / log entry I meant the part from the time stamp up to the next blank line. That means each record would contain all the pieces of data you are interested in and it would not contain the same data from the next record, so there would be no danger that you'd accidentally capture the "Want-To-Have-This" of the next record. – blubberdiblub Jan 23 '17 at 13:42
  • Ok, I guess I could find the matches for the entire log line, and then look for "sub-matches" in this main match based on the other regexes I am looking for? I think from the log line I am pretty much certain that the timestamp starts it and the string "CONTENT:" string ends it. – Unpossible Jan 23 '17 at 15:02
  • Is there never anything after the line with "CONTENT:"? Judging from the name "CONTENT", I would expect there to be some content ;) Well, at least sometimes. – blubberdiblub Jan 23 '17 at 18:15
  • :) Sometimes there is. But the lines I am looking for should have nothing after CONTENT:. – Unpossible Jan 23 '17 at 18:34
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133838/discussion-between-sina-and-blubberdiblub). – Unpossible Jan 23 '17 at 20:16

1 Answers1

0

Thanks, @blubberdibulb! You helped me narrow down my block matching regex to first_match = re.compile(r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.*?(?=^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)", re.DOTALL|re.MULTILINE) which makes more manageable chunks to parse. Everything's working much better now.

Unpossible
  • 603
  • 6
  • 23