So one of my major pain points is name comprehension and piecing together household names & titles. I have a 80% solution with a pretty massive regex I put together this morning that I probably shouldn't be proud of (but am anyway in a kind of sick way) that matches the following examples correctly:
John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA
The regex matcher looks like this:
(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?
(wtf right?)
For convenience: http://www.pyregex.com/
So, for the example:
'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
the regex results in a group dict that looks like:
connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD
I need help with the final step that has been tripping me up, comprehending possible middle names.
Examples include:
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
Is this possible and is there a better way to do this without machine learning? Maybe I can use nameparser (discovered after I went down the regex rabbit hole) instead with some way to determine whether or not there are multiple names? The above matches 99.9% of my cases so I feel like it's worth finishing.
TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.
Note: I don't need to parse titles like Mr. Mrs. Ms., etc., but I suppose that can be added in the same manner as middle names.
Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.