Trying to pull names and countries out of a string with regex, but getting overlapping matches

Question

I realize this is an odd use case. The non-profit I work for has a google sheet that has webinar info in it. I am trying to pull this info out and then format it in HTML so they can just copy-paste paste onto the website.
Reach row is one webinar.
All the speakers are in one cell for each webinar.

I am trying to pull the names and country out and put them in a list. I am super new to python so if there I a better way I am super open to it.

Sample data from the cell:

Moderators: 
Willaim Riker MD (USA) <wriker@example.com>
Deana Troy (Portugal) <dtroy@gmail.com>

Speakers:
Tasha Yar, MD PhD (Brazil) <example@example.com>
S'chn T'gai Spock, MD PhD FACS (USA) <example@gexample.com>
Leonard James Akaar, MD (Argentina) <example@gexample.com>
Worf Wo'rIv PhD (Brazil) <example@example.com>

What I have so far-using gspread:

import re
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import json
scopes = [
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive'
]
credentials = ServiceAccountCredentials.from_json_keyfile_name("webinar-gs-html-16720182572b.json", scopes) #access the json key you downloaded earlier 
file = gspread.authorize(credentials) # authenticate the JSON key with gspread
sheet = file.open("PAAO_Webinar_Calendar")
worksheet = sheet.get_worksheet(0)
row = input("Enter Row Number:")

values_list = worksheet.row_values(row)

#make a list of the speakers
alspkrs = worksheet.cell(row, 7).value
spkrs = re.findall('([a-zA-Z]+\s[a-zA-Z]+\s+\([a-zA-Z]+\))', alspkrs)
spkrnum = len(spkrs)
print(spkrs)

print(spkrs) results in

['Riker MD (USA)', 'Deana Troy (Portugal)', 'MD PhD (Brazil)', 'PhD FACS (USA)', 
 'rIv PhD (Brazil)']

I originally tried something like for line in alspkrs: etc but couldn't figure out how to get the names when sometimes they are 2 words, sometimes 3 or four.

Ideally, they would end up in a dictionary with keys like moderator1, speaker1, etc but I am not there yet.

I'm not sure this is a regex problem. What you need is a way to identify that the line is a speaker, and then user the entire line, except for the email address. Right? If the email always follows the last right paren, that should be easy. — Tim Roberts, Jan 29 '22 at 00:36
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Feb 07 '22 at 12:00

score 1 · Answer 1 · edited Jan 29 '22 at 01:39

The main problem is that not all the lines contain a comma that separates the name and the title, so your regex couldn't know when to stop matching the name and start matching the title.

The best result I managed to come with is:

"([a-zA-Z' ]+),?\s*(MD|PhD|MD PhD| MD PhD FACS)?\s+\(([a-zA-Z]+)\)"

Which gives this result with your sample data:

import re

sample_data = """
Moderators:
Willaim Riker MD (USA) wriker@example.com
Deana Troy (Portugal) dtroy@gmail.com

Speakers:
Tasha Yar, MD PhD (Brazil) example@example.com
S'chn T'gai Spock, MD PhD FACS (USA) example@gexample.com
Leonard James Akaar, MD (Argentina) example@gexample.com
Worf Wo'rIv PhD (Brazil) example@example.com
"""

regex = "([a-zA-Z' ]+),?\s*(MD|PhD|MD PhD| MD PhD FACS)?\s+\(([a-zA-Z]+)\)"

spkrs = re.findall(regex, sample_data)
print(spkrs)
# Prints:
# [('Willaim Riker MD', '', 'USA'), 
#  ('Deana Troy', '', 'Portugal'), 
#  ('Tasha Yar', 'MD PhD', 'Brazil'), 
#  ("S'chn T'gai Spock", ' MD PhD FACS', 'USA'), 
#  ('Leonard James Akaar', 'MD', 'Argentina'), 
#  ("Worf Wo'rIv PhD", '', 'Brazil')]

There are still problems where the title is not matched correctly and matched as part of the name, this is because the comma problem I mentioned. You could try to solve it by fixing individual matches and moving their title to the correct place (or removing them if they are not needed).

Thank you. I see what I did. I think I will have to have them make the data more consistent. — zenhappens, Jan 29 '22 at 14:34

Trying to pull names and countries out of a string with regex, but getting overlapping matches

1 Answers1