2

I am trying to extract data from a change log using RegEx. Here is an example how the change log is structured:

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
120091
this is some changes in the ticket
some new version: z.z.22
another change
another change
another change
new version: z.y.2.2
120092
...
...
...
  • Each data point starts with an ID which has a range of 5 to 6 digits.
  • Moreover there is a variable amount of changes (lines) in the log per ID.
  • Each data point ends with new version: ***. *** is string which is variable for every ID.

I was using the RegExStrom Tester to test my RegEx.

So far I have: ^\w{5,6}(.|\n)*?\d{5,6} however the result includes the ID from the next ticket, which I need to avoid.

Result:

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
120091 

Expected Result:

96545
this is some changes in the ticket
some new version: x.x.22
another change
new version: x.y.2.2
Audiogott
  • 95
  • 2
  • 12
  • `(?sm)^\d{5,6}.*?(?=^\d{5}|\Z)`? Why do you use Regexstorm if this regex is going to be used in Python? Use https://regex101.com/r/i53Qzw/1 to test the regex. – Wiktor Stribiżew Oct 04 '19 at 17:39
  • Also, why didn't you use `\nnew version:` as the right hand delimiter since you mention it in the pattern requirements? Try `(?sm)^\d{5,6}.*?\nnew version:[^\r\n]*` – Wiktor Stribiżew Oct 04 '19 at 18:12

4 Answers4

2

Captures each records ID in group 1 and content in group 2

r'(?ms)^(\d{5,6}\r?\n)(.*?)^new version:'

https://regex101.com/r/A3ejjN/1

1

This would do it:

^\d{5,6}[\r\n]*.*?^new version:[^\r\n]*

Just make sure to enable the MULTILINE and DOTALL flags via re.MULTILINE | re.DOTALL

https://regex101.com/r/YeIUQx/1

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • I think `/^\d{5,6}[\r\n]*.*?^new version:[^\r\n]*/gms` is rather a confusing way of sharing Python regex. You can't use regex literals in Python. Besides, it does not support the global modifier, there are specific `re` methods that handle that part. – Wiktor Stribiżew Oct 04 '19 at 19:10
  • @WiktorStribiżew I can see your point, I've edited that part out. – MonkeyZeus Oct 04 '19 at 19:29
1

Your regular expression is close. Its issue is that it's "ending" at the start of the next log, by using \d{5,6} to mark the end of a log entry (and matching it in the process). As Wiktor mentioned, it would make more sense to use "new version" as the delimeter, so I've done that here.

found_matches = re.findall("(^\d{5,6}[\s\S]*?^new version: .*$)", log_file_content, re.MULTILINE)

The regex (^\d{5,6}[\s\S]*?^new version: .*$) searches for 5 or 6 digits at the start of the line, and then takes any character (including newlines) up until the first instance of new version: that appears at the start of a line. It then reads to the end of the line to finish that group. Since you're going to be matching across newlines, be sure to remember the re.MULTILINE argument!

Test the regex here, and the full python code here.

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
1

If the problem was that you capture the ID of the next Ticket just use positive look ahead to mach it but not capture it, or consume it:

# end of tickets is the end of line that the line after it contains the Id of the next ticket
pattern = r"\d{5,6}[\s\S]*?(?=\n\d{5,6})"

# to extract first ticket info just use search
print(re.search(pattern, text).group(0))

# to extract all tickets info in a list use findall
print(re.findall(pattern, text))

# if the file is to big and you want to extract tickets in lazy mode
for ticket in re.finditer(pattern,text):
    print(ticket.group(0))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Charif DZ
  • 14,415
  • 3
  • 21
  • 40
  • Hello Charif, thank you very much, that helps a lot. However now I am facing another issue. I am trying to append each log to its corresponding ticket using pandas, which are in a ```.csv``` file. ```In: ids.head(2) | Out: {'Index': [0, 1], 'id': [96545, 120091 ]}... ```. By looping through the rows, I tried appending each found regex ticket to a cell next to the id. However I am getting the same regex text in each row. ```search = [] for values in ids['id']: search.append(re.findall(pattern, text)) ids['tickethist'] = search ``` How can I append each found regex element? – Audiogott Oct 05 '19 at 14:04
  • You ca n edit your question and add the code to understand what is the problem – Charif DZ Oct 05 '19 at 15:08