3

Hi I have a script to go through a notepad of numbers with a series of regex's. My regex's are working with the exception of a few values that are not showing up properly. For instance some number examples such as 11111-C00 or 22222-X01 they are returned only as 11111 and 22222 and are not including "-" and what is happening afterwards. I as well have a few cases that end in the format: number, letter number. These 2 regex's aren't giving me my desired outcome: d{4,5}-\w{1}\d{2} and \d{4}-\w\d{1}\w

Full Code:

import re

filename = 'Text.txt'
pattern = '\d{4,5}-\d{2,3}|\d{4,9}|\w{3}\d-\d{2}|\d{4,5}-\w{1}\d{2}|\b|\d{4}-\w\d{1}\w'
new_file = []

with open('Text.txt', 'r') as f:
lines = f.readlines()

for line in lines:
 match = re.search(pattern, line)
 if match:
      new_line = match.group() + '\n'
      print new_line
      new_file.append(new_line)

with open('NewText.txt', 'w') as f:
 f.seek(0)
 f.writelines(new_file)

So all of my regex's are working fine except the last 2 (d{4,5}-\w{1}\d{2} and \d{4}-\w\d{1}\w) for patterns such as XXXXX-LXX and XXXXX-LXL where X is a number and L is a letter, they are only being returned as XXXX or XXXXX. Where am I going wrong?

lucyb
  • 333
  • 5
  • 15
  • 1
    You are searching for `\d{4,9}` before `'d{4,5}-\w{1}\d{2}` and `\d{4}-\w\d{1}\w`. Check if Switching this around to `\d{4,5}-\d{2,3}|\w{3}\d-\d{2}|\d{4,5}-\w{1}\d{2}|\b|\d{4}-\w\d{1}\w|\d{4,9}` solves your issue – Tomasz Plaskota Jun 09 '16 at 14:13

1 Answers1

1

It matches 11111 because in your alternation, the branch \d{4,9} is matching first. Change order to:

\d{4,5}-\d{2,3}|\w{3}\d-\d{2}|\d{4,5}-\w{1}\d{2}|\b|\d{4}-\w\d{1}\w|\d{4,9}

See demo

You can see alternation as:

Input = 11111-C00
Regex = \d{4,5}-\d{2,3}|\w{3}\d-\d{2}|\d{4,9}|\d{4,5}-\w{1}\d{2}|\b|\d{4}-\w\d{1}\w

Does Input matches \d{4,5}-\d{2,3} ? NO, Then,
Does Input matches \w{3}\d-\d{2} ? NO, Then,
Does Input matches \d{4,9} ? YES, Match found, stop looking
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142