When the regex gets too complex, because it tries to match to many cases, then often support by an algorithmic approach is needed.
Here the phone-number in different formats (using dot or dash as separator) might be good candidate for a regex.
But when adding the markers (in variations like prefix and suffix or upper-case and lower-case) the regex is getting more and more complex.
Also the line-breaks in a multi-line text might be hard for a regex to cover.
Use the markers for location, the regex for extraction
When we scan a text for phone-numbers, we can use markers (like in your case) to locate them first. Then in the next step we could parse the phone-number using a regex. The phone-number may be located before or after the marker if the marker is used as prefix or suffix respectively.
See following approach:
- a set of marker strings used to narrow down the location
- a regex for the phone-number to extract it
import re
text = '''M: 360.751.0001
M 360.751.0002
(M): 360.751.0003
(M):360.751.0004
M:360.751.0005
(Mobile): 360.751.0006
(Mobile):360.751.0007
(Mobile) 360.751.0008
Mobile: 360.751.0009
Mobile:360.751.0010
Mobile 360.751.0011
360.751.0012 Mobile
360.751.0013 (M)
360.751.0014 M'''
email_signature = '''
Jane Doe
Ocean Export Agent
Some Company, Inc.
Celebrating 100 years!
p:
410-123-3 001 m: 410-123-0002
a:
111 Cromwell Park Drive, Glen Burnie, MD 21061
w:
website.com e: janedoe@somecompany.com
'''
# leading space for a suffix, trailing space for a prefix
phone_markers = {' M', 'M ', 'M:', '(M)', ' Mobile', 'Mobile ', 'Mobile:', '(Mobile)'}
def find_phone_numbers_marked(text, phone_markers):
phone_numbers = []
for line in text.split('\n'):
found_marked = [(marker, line.find(marker)) for marker in
phone_markers if line.find(marker) >= 0]
for marker, position in found_marked:
if marker.startswith(' '):
text_marked = line[:position] # text before marker
else:
text_marked = line[position:] # text after marker
found_numbers = re.findall(r'\d{1,3}[.-]\d{1,3}[.-]\d{1,4}', text_marked)
print(line, f"Marker '{marker}' at position {position}, found number:", found_numbers)
phone_numbers.extend(found_numbers)
return phone_numbers
total_lines = len(text.split('\n'))
print(f"== Searching in {total_lines} lines ..")
result = find_phone_numbers_marked(text, phone_markers)
print(f"== Found {len(result)} numbers:", result)
total_lines = len(email_signature.split('\n'))
print(f"== Searching in {total_lines} lines ..")
phone_markers.update([m.lower() for m in phone_markers]) # also include lowercase version of markers
result = find_phone_numbers_marked(email_signature, phone_markers)
print("== Found:", result)
Output:
== Searching in 14 lines ..
M: 360.751.0001 Marker 'M:' at position 0, found number: ['360.751.0001']
M 360.751.0002 Marker 'M ' at position 0, found number: ['360.751.0002']
(M): 360.751.0003 Marker '(M)' at position 0, found number: ['360.751.0003']
(M):360.751.0004 Marker '(M)' at position 0, found number: ['360.751.0004']
M:360.751.0005 Marker 'M:' at position 0, found number: ['360.751.0005']
(Mobile): 360.751.0006 Marker '(Mobile)' at position 0, found number: ['360.751.0006']
(Mobile):360.751.0007 Marker '(Mobile)' at position 0, found number: ['360.751.0007']
(Mobile) 360.751.0008 Marker '(Mobile)' at position 0, found number: ['360.751.0008']
Mobile: 360.751.0009 Marker 'Mobile:' at position 0, found number: ['360.751.0009']
Mobile:360.751.0010 Marker 'Mobile:' at position 0, found number: ['360.751.0010']
Mobile 360.751.0011 Marker 'Mobile ' at position 0, found number: ['360.751.0011']
360.751.0012 Mobile Marker ' Mobile' at position 12, found number: ['360.751.0012']
360.751.0012 Mobile Marker ' M' at position 12, found number: ['360.751.0012']
360.751.0013 (M) Marker '(M)' at position 13, found number: []
360.751.0014 M Marker ' M' at position 12, found number: ['360.751.0014']
== Found 14 numbers: ['360.751.0001', '360.751.0002', '360.751.0003', '360.751.0004', '360.751.0005', '360.751.0006', '360.751.0007', '360.751.0008', '360.751.0009', '360.751.0010', '360.751.0011', '360.751.0012', '360.751.0012', '360.751.0014']
== Searching in 12 lines ..
410-123-3 001 m: 410-123-0002 Marker 'm:' at position 15, found number: ['410-123-0002']
410-123-3 001 m: 410-123-0002 Marker ' m' at position 14, found number: ['410-123-3']
111 Cromwell Park Drive, Glen Burnie, MD 21061 Marker ' M' at position 37, found number: []
website.com e: janedoe@somecompany.com Marker 'm ' at position 10, found number: []
== Found: ['410-123-0002', '410-123-3']
Not yet perfect, but could be refined in markers and regex.