1

I am trying to extract and preprocess log data for a use case.

For instance, the log consists of problem numbers with information to each ID underneath. Each element starts with:

#!#!#identification_number###96245#!#!#change_log###
action
action1
change
#!#!#attribute###value_change
#!#!#attribute1###status_change
#!#!#attribute2###<None>
#!#!#attribute3###status_change_fail
#!#!#attribute4###value_change
#!#!#attribute5###status_change

#!#!#identification_number###96246#!#!#change_log###
action
change
change1
action1
#!#!#attribute###value_change
#!#!#attribute1###status_change_fail
#!#!#attribute2###value_change
#!#!#attribute3###status_change
#!#!#attribute4###value_change
#!#!#attribute5###status_change

I extracted the identification numbers and saved them as a .csv file:

f = open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8")
change_log = f.readlines()

number = re.findall('#!#!#identification_number###(.+?)#!#!#change_log###', change_log)

Now what I am trying to achieve is, that for every ID in the .csv file I can append the corresponding log content, which is:

action
change
#!#!#attribute###

Since I am rather new to Python and only started working with regex a few days ago, I was hoping for some help.

Each log for an ID starts with "#!#!identification_number###" and ends with "#!#!attribute5### <entry>".

I have tried the following code, but the result is empty:

In:
x = re.findall("\[^#!#!#identification_number###((.|\n)*)#!#!#attribute5###((.|\n)*)$]", str(change_log))

In: 
print(x)

Out:
[]
Audiogott
  • 95
  • 2
  • 12

2 Answers2

1

Try this:

pattern='entification_number###(.+?)#!#!#change_log###(.*?)#!#!#id'

re.findall(pattern, string+'#!#!#id', re.DOTALL)

The dotall flag makes the point match newline, so hopefully in the second capturing group you will find the logs.

If you want to get the attributes, for each identification number, you can parse the logs (got for the search above) of each id number with the following:

pattern='#!#!#attribute(.*?)###(.*?)#!#'

re.findall(pattern, string_for_each_log_match+'#!#', re.DOTALL)

Tomas G.
  • 3,784
  • 25
  • 28
  • Hello Thomas, thank you for your reply. I must admit I was a bit lazy when I formulated the content of the question. When I apply your code, it ignores the attributes for each identification number keeping only the actions and changes. But I need the attributes as well. – Audiogott Sep 24 '19 at 09:46
  • OK I changed the edit. It is kind of messy but should get the attributes – Tomas G. Sep 25 '19 at 11:47
0

If you put each id into the regex when you search using string.format() you can grab the lines that contain the correct changelog.

with open(r'path\to\csv.csv', 'r') as f:
    ids = f.readlines()

with open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8") as f:
    change_log = f.readlines()

matches = {}
for id_no in ids:
    for i in range(len(change_log)):
        reg = '#!#!#identification_number###({})#!#!#change_log###'.format(id_no)
        if re.search(reg, change_log[i]):
            matches[id_no] = i
            break

This will create a dictionary with the structure {id_no:line_no,...}. So once you have all of the lines that tell you where each log starts, you can grab the lines you want that come after these lines.

mattrea6
  • 288
  • 1
  • 9
  • `"{}".format(x)` will take the string and put x inside of the braces. `"Hello {}!".format("Audio")` would print `"Hello Audio!"`. It works with more braces too. `"{}{}{}".format(x,y,z)` works. – mattrea6 Sep 24 '19 at 12:20
  • Hello mattrea, I finally got around to try your method. However unfortunately the ```matches```dictionary is returned empty. :/ – Audiogott Oct 04 '19 at 14:48