8

So one of my major pain points is name comprehension and piecing together household names & titles. I have a 80% solution with a pretty massive regex I put together this morning that I probably shouldn't be proud of (but am anyway in a kind of sick way) that matches the following examples correctly:

John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA

The regex matcher looks like this:

(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?

(wtf right?)

For convenience: http://www.pyregex.com/

So, for the example:

'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'

the regex results in a group dict that looks like:

connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD

I need help with the final step that has been tripping me up, comprehending possible middle names.

Examples include:

'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'

Is this possible and is there a better way to do this without machine learning? Maybe I can use nameparser (discovered after I went down the regex rabbit hole) instead with some way to determine whether or not there are multiple names? The above matches 99.9% of my cases so I feel like it's worth finishing.

TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.

Note: I don't need to parse titles like Mr. Mrs. Ms., etc., but I suppose that can be added in the same manner as middle names.

Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.

mzniko
  • 83
  • 5
  • 14
    Python's [Natural Language Toolkit (NLTK)](http://www.nltk.org/) is *much* better-suited for this task. Check this out: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code – Curtis Mattoon Feb 25 '15 at 20:23
  • 1
    @CurtisMattoon Ohh. That looks nice. I was hacking away at this regex in order to get some data out "RIGHT NOW" but NLTK looks like a great long-term solution (and maybe short-term, too). I'm a junior python dev so I don't know about all the solutions out there. – mzniko Feb 25 '15 at 20:31
  • 3
    Even though there might be better tools for your task, you could use `re.VERBOSE` flag to make your current regex more readable. https://docs.python.org/3/library/re.html#re.VERBOSE – user Feb 25 '15 at 20:45
  • Also, you can treat your pattern as you would a string. E.g. `r'%s' % 'cat'`. – user Feb 25 '15 at 20:51
  • Your regex could be much simpler if you first split the string on `"&"` and `"and"` and then parsed the pieces. – Steven Rumbalski Feb 25 '15 at 21:05
  • 1
    Before you continue, read this: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ – Mark Ransom Feb 25 '15 at 21:25

1 Answers1

8

Regular expressions like this are the work of the Dark One.

Who, looking at your code later, will be able to understand what is going on? Will you even?

How will you test all of the possible edge cases?

Why have you chosen to use a regular expression at all? If the tool you are using is so difficult to work with, it suggests that maybe another tool would be better.

Try this:

import re

examples = [
  "John Jeffries",
  "John Jeffries, M.D.",
  "John Jeffries, MD",
  "John Jeffries and Jim Smith",
  "John and Jim Jeffries",
  "John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries M.D. and Jennifer Holmes CPA",
  "John Jeffries M.D. & Jennifer Holmes CPA",
  'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
  'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]

def IsTitle(inp):
  return re.match('^([A-Z]\.?)+$',inp.strip())

def ParseName(name):
  #Titles are separated from each other and from names with ","
  #We don't need these, so we remove them
  name = name.replace(',',' ') 
  #Split name and titles on spaces, combining adjacent spaces
  name = name.split()
  #Build an output object
  ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
  #First string is always a first name
  ret_name['first'] = name[0]
  if len(name)>2: #John Johnson Smith/PhD
    if IsTitle(name[2]): #John Smith PhD
      ret_name['last']   = name[1]
      ret_name['titles'] = name[2:]
    else:                #John Johnson Smith, PhD, MD
      ret_name['middle'] = name[1]
      ret_name['last']   = name[2]
      ret_name['titles'] = name[3:]
  elif len(name) == 2:   #John Johnson
    ret_name['last'] = name[1]
  return ret_name

def CombineNames(inp):
  if not inp[0]['last']:
    inp[0]['last'] = inp[1]['last']

def ParseString(inp):
  inp = inp.replace("&","and")    #Names are combined with "&" or "and"
  inp = re.split("\s+and\s+",inp) #Split names apart
  inp = map(ParseName,inp)
  CombineNames(inp)
  return inp

for e in examples:
  print e
  print ParseString(e)

Output:

John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]

This took less than fifteen minutes and, at each stage, the logic is clear and the program can be debugged in pieces. While one-liners are cute, clarity and testability should take precedence.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • 1
    I repent my use of dark magic! I hope this helps others avoid the same. I'll accept this as the solution and add a note pointing people to NLTK and the nameparser lib. – mzniko Feb 25 '15 at 21:40