Regex match specific word in string but exclude indexed versions

Question

I'm sure that if a solution exists for this then its out there somewhere but I can't find it. I've followed Python regex to match a specific word and had success in the first aspect but now am struggling with the second aspect.

I've inherited a horrible file format where each test result is on its own line. They are limited to 12 chars per record so some results are split into groups of lines e.g SITE, SITE1 and SITE2. I'm trying to parse the file into a dictionary so I can do more analysis with it and ultimately produce a formatted report.

The link above / code below allows me to match each SITE and concatenate them together but its giving me problems matching INS, INS 1 and INS 2 correctly. Yes the space is intentional - its what I have to deal with. INS is the test result and INS 1 is the limit of the test for a pass.

Is there a regular expression that would match

SITE > SITE True but SITE > SITE1 false

and

INS > INS True but INS to INS 1 false?

Here is the python code.

import re    
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", "INS", "INS 1"]
for line in lines:
    for heading in headings:
        headregex = r"\b" + heading + r"\b"
        match = re.search(headregex,heading)
        if match:
            print "Found " + heading + " " + line
        else:
            print "Not Found " + heading + " " + line

And here is some dummy data:

TEST MODE 131 AUTO SITE startaddy SITE1 middle addy SITE2 end addy USER DB VISUAL CHECK P BOND RANGE 25A EARTH 0.09 OHM P LIMIT 0.10 OHM INS 500 V INS 1 >299 MEG P ... TEST MODE 231 AUTO SITE startaddy SITE1 middle addy SITE2 end addy USER DB VISUAL CHECK P INS 500 V INS 2 >299 MEG P ...

Sorry for the horrid formatting - its copied and pasted from what I am dealing with!

Why are you using `re.escape` and `\b`s together? What can `headings` contain? Can they start / end with a non-word char? — Wiktor Stribiżew, Nov 30 '17 at 09:14
re.escape and \b - lack of experience! From the 24 or so sample records I have it looks like they all start with letters and no spaces etc but lots of other whitespace stuff occurs later in the line. — Byte Insight, Nov 30 '17 at 09:18
Can you give an actual example as well? From the descriptions you give I am not sure what conditions exactly need to be met. — Arne, Nov 30 '17 at 09:19
Arne, if you can run the code then Site should match to Site but not either Site1 or Site2. Ins should match to Ins but not Ins 1. — Byte Insight, Nov 30 '17 at 09:22
Well, it is a bit unclear, maybe you want to sort the headings by length first? See https://ideone.com/mCpZvX — Wiktor Stribiżew, Nov 30 '17 at 09:22
Thanks Wiktor. Its not the easiest problem to explain. In your example the last match is the problem Found INS INS 1 value2 INS should not be matching against INS 1. — Byte Insight, Nov 30 '17 at 09:25
You can only exclude that match by adding `(?! 1\b)` lookahead after `INS`, see https://ideone.com/90TJE3. You seem to want to check if there is a match for all headings, not just the first found, and that makes it rather difficult. — Wiktor Stribiżew, Nov 30 '17 at 09:29
So, what you are trying to do is joining all `SITE` fields and all `INS` fields together? — Arne, Nov 30 '17 at 09:42
I don't speak python, so could someone please explain to me why all headings are found in all lines in @WiktorStribiżew 's example(s)? — SamWhan, Nov 30 '17 at 09:50
@ArneRecknagel. No, I'm trying to extract the key and the value from the raw data but they are delimited by spaces and spaces also occur within the key and the value. Its a badly designed file that I have no control over! — Byte Insight, Nov 30 '17 at 11:16
@WiktorStribiżew Yes! I think that is the answer. Will double check with full data and get back to you. — Byte Insight, Nov 30 '17 at 11:18
@WiktorStribiżew. Ok. So very close it works for INS 1 but I have just discovered INS 2 so is there a way to look ahead for two different options? — Byte Insight, Nov 30 '17 at 11:45
Yes, `(?! \d)` or `(?!\s*\d)` if there should be no digits at all. — Wiktor Stribiżew, Nov 30 '17 at 11:59
@WiktorStribiżew. Yes first option works perfectly. Would you like to write up the answer and I'll accept? — Byte Insight, Nov 30 '17 at 12:01

score 1 · Accepted Answer · answered Nov 30 '17 at 12:05

The problem is that INS pattern finds a partial match in INS in INS 1 or INS 2 etc.

In cases when you extract alternatives, it is customary to use alternations starting with the longest value (like INS \d+|INS), but in this case you are looking to obtain a list of all regex matches only excluding some overlapping heading matches.

To achieve that, there is a way to exclude that match by treating all headings items as regular expressions, and define the INS pattern as INS(?! \d) to make sure INS is not matched if it is followed with a space and a digit.

See the Python demo:

import re    
lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", r"INS(?! \d)", "INS 1"]
headings=sorted(headings, key=lambda x: len(x), reverse=True)
for line in lines:
    print("----")
    for heading in headings:
        headregex = r"\b{}\b".format(heading)
        match = re.search(headregex,heading)
        if match:
            print "Found " + heading + " " + line
        else:
            print "Not Found " + heading + " " + line

Arne · Answer 2 · 2017-11-30T13:00:46.233

Just to give an answer that might solve the problem while avoiding some of the tediousness, is this what you are trying to achieve?

import re

lines = ['SITE start', 'SITE1 more', 'SITE2 end','INS value1', 'INS 1 value2']
headings = ['SITE','SITE1',"SITE2", "INS", "INS 1"]

headings_re = re.compile(r"(SITE\d? )?(INS( \d)?)? (.*)") 
# build by hand, only works if SITE and INS are the literal identifiers 

site = []
ins = []

for line in lines:
  match = headings_re.match(line)
  if match:
    if match.group(1):
      site.append(match.group(4))
    elif match.group(2):
      ins.append(match.group(4))
    else:
      print("something weird happened")
      print(match.group(0))
  else:
    print("something weird happened")
    print(line)

print("SITE: {}".format(" ".join(site)))
>> SITE: start more end
print("INS: {}".format(" ".join(ins)))
>> INS:  value1  value2

No. this doesn't help me - sorry. I'll edit original question to show some dummy data. — Byte Insight, Nov 30 '17 at 11:46

Regex match specific word in string but exclude indexed versions

2 Answers2