Filtering Log File with RegEx

Question

Hi I can't seem to work out how to extract the Date and PID from a log file. I'm trying to display the date and then the pid as shown below. But it will not show the PID only the date.

Please see my code:

def show_time_of_pid(line):

  pattern = r"^([\w+]*[\s\d\:]+.[\[(\d+)\]])"
  result = re.search(pattern, line)

  return result

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)")) # Jul 6 14:01:23 pid:29440
<re.Match object; span=(0, 14), match='Jul 6 14:01:23'>

I was expecting Jul 6 14:01:23 pid:29440

I get <re.Match object; span=(0, 14), match='Jul 6 14:01:23'> **NO PID DISPLAYED

You might want to spend some time with https://regex101.com/ (make sure you select "Python" from the list on the left). I'm pretty sure that bare `.` in the middle of your expression is a typo and you probably meant something like `.*`, but that's just the first thing that jumps out. — larsks, Aug 06 '23 at 00:46

score 1 · Accepted Answer · answered Aug 06 '23 at 00:56

I would probably write things like this:

def show_time_of_pid(line):

    pattern = r"^(\w{3}) \s (\d+) \s ([\d:]+) \s .[^[]+\[(\d+)]:.*"
    result = re.search(pattern, line, flags=re.VERBOSE)

    return result.groups()

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))

Using re.VERBOSE lets us split things up to be a little easier to read. Here we have several distinct match groups:

(\w{3}) matches the month name
(\d+) matches the day of the month
([\d:]+) matches the time
[^[]+\[(\d+)] matches the PID ("a bunch of characters that are not [ followed by [, then a string of digits, then ])

Each group is separated by whitespace (\s).

Running the above code produces:

('Jul', '6', '14:01:23', '29440')

You could get fancier with an outer capture group; by writing:

import re

def show_time_of_pid(line):

    pattern = r"^((\w{3}) \s (\d+) \s ([\d:]+)) \s .[^[]+\[(\d+)]:.*"
    result = re.search(pattern, line, flags=re.VERBOSE)

    return result.groups()

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))

We get the entire date string in the first capture group:

('Jul 6 14:01:23', 'Jul', '6', '14:01:23', '29440')

And of course we can get back a labeled dictionary instead of just a list by using named capture groups:

import re

def show_time_of_pid(line):

    pattern = r"^(?P<timestamp>(?P<month>\w{3}) \s (?P<day>\d+) \s ([\d:]+)) \s .[^[]+\[(?P<pid>\d+)]:.*"
    result = re.search(pattern, line, flags=re.VERBOSE)

    return result.groupdict()

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))

Which produces:

{'timestamp': 'Jul 6 14:01:23', 'month': 'Jul', 'day': '6', 'pid': '29440'}

Is that a question? Are you suggesting an edit? I am unclear. It's not really possible to put code in comments. — larsks, Aug 06 '23 at 01:28

score 0 · Answer 2 · edited Aug 06 '23 at 01:38

Hey can someone tell me if this is an acceptable work around for my problem - it seemed to work! Thankyou for your replies too - appreciate it. Its hard to get your head around this stuff!!

def show_time_of_pid(line):

    pattern1 = r"^([\w]*[\s\d:]*)"
    pattern2 =r"\[(\d+)\]"
    result = re.search(pattern1, line)
    result2= re.search(pattern2,line)

  return "{} pid:{}".format(result[1],result2[1])

Filtering Log File with RegEx

2 Answers2