1

I have the following regex (?:RE:\w+|Reference:)\s*((Mr|Mrs|Ms|Miss)?\s+([\w-]+)\s(\w+)).

Input text examples:

  1. RE:11567 Miss Jane Doe 12345678
  2. Reference: Miss Jane Doe 12345678
  3. RE:J123 Miss Jane Doe 12345678
  4. RE:J123 Miss Jane Doe 12345678 Reference: Test Company

Sample Code:

import re

pattern = re.compile('(?:RE:\w+|Reference:)\s*((Mr|Mrs|Ms|Miss)?\s+([\w-]+)\s(\w+))')
result = pattern.findall('RE:11693 Miss Jane Doe 12345678')

For all 4 I expect the output ('Miss Jane Doe', 'Miss', 'Jane', 'Doe'). However in 4th text example I get [('Miss Jane Doe', 'Miss', 'Jane', 'Doe'), (' Test Company', '', 'Test', 'Company')]

How can I get the correct output

West
  • 2,350
  • 5
  • 31
  • 67

1 Answers1

1

Just add ^ to the start of the regex to only match at the start. This makes it ^(?:RE:\w+|Reference:)\s*((Mr|Mrs|Ms|Miss)?\s+([\w-]+)\s(\w+)).

Gamma032
  • 441
  • 4
  • 7
  • This gives an error on 4th example `AttributeError: 'NoneType' object has no attribute 'groups'` – West Dec 21 '22 at 03:39
  • I actually can't reproduce your issue to begin with. What version of Python are you on? It works for me on 3.10.9. – Gamma032 Dec 21 '22 at 03:50
  • Im on python 3.8 – West Dec 21 '22 at 03:51
  • And sorry I've updated my question, was supposed to be findall instead of search – West Dec 21 '22 at 03:55
  • Could you clarify what your goal here is? `.search()` works here because it only searches from the start of the string. If you want the first match inside a larger string, you could do `pattern.findall('RE:J123 Miss Jane Doe 12345678 Reference: Test Company')[0]`. – Gamma032 Dec 21 '22 at 04:06
  • Im using the `invoice2data` library and I believe its using `findall`. I only provide the library with the regex string. There is no option for specifying whether to use `search` or `findall`. I suspected it uses findall because of the output from 4th example – West Dec 21 '22 at 04:21
  • So `findall` ignores non capturing groups? – West Dec 21 '22 at 04:22
  • 1
    `search()` finds the first instance, whereas `findall`returns all instances. But if we add the `^` to the findall regex, we'll only get the result that begins at the start of the string. So if `invoice2data` truly does use `findall` ([which I think it does](https://github.com/invoice-x/invoice2data/blob/1adb49cb74b17e1dd278886ee4c432ed2dcaf443/src/invoice2data/extract/parsers/regex.py#L33)), that should work. – Gamma032 Dec 21 '22 at 04:37
  • That worked thanks. I might have to dig into `invoice2data` to see how it works because even though the regex is working when I'm testing on jupyter its not returning results for the 4th example when using the library. – West Dec 21 '22 at 05:53