3

I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.

A transcript looks like this:

>> John doe: Hello, I am John Doe.

>> Hello, I am Jane Doe.

>> Thank you for coming, we will start in two minutes.

>> Sam Smith: [no audio] Good morning, everyone.

To find the name of speakers within >> (WHATEVER NAME):, I wrote

pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)

I expected 'John Doe' and 'Sam Smith', but it is giving me 'John Doe' and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'

I am confused because .*? is non-greedy, which (I think) should be able to grab 'Sam Smith'. How should I fix the code so that it only grabs whatever in >> (WHATEVER NAME):? Also, I am using Python 3.6.

Thanks!

cs95
  • 379,657
  • 97
  • 704
  • 746
ybcha204
  • 91
  • 3
  • 1
    You're misinterpreting what non-greediness means. It means that starting at some left anchor, it will read as little as it has to to form a match. If there is any match from some left anchor, it keeps it. It does *not* mean that it will pull the left anchor to the right in order to shorten a match. – BallpointBen May 02 '18 at 03:42
  • Although not strictly identical to what you want, you may simply use `>> ([^>:])*:`, unless you are going to have `>` in the name – Adrian Shum May 02 '18 at 03:56
  • @BallpointBen I see. Thank you for the clarification. What should I do in this case? – ybcha204 May 02 '18 at 03:57

2 Answers2

4

Do you really need regex? You can split on >> prompts and then filter out your names.

>>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
['John doe', 'Sam Smith']
cs95
  • 379,657
  • 97
  • 704
  • 746
2

Your understanding of a non-greedy regex is slightly off. Non-greedy means it will match the shortest match possible from when it begins matching. It will not change the character it begins matching from if another one is found in the match.

For example:

start.*?stop

Will match all of startstartstop, because once it starts matching at start it will keep matching until it finds stop. Non-greedy simply means that for the string startstartstopstop, it would only match up until the first stop.

For your question, this is an easy problem to solve using positive lookahead.

You may use >> ([a-zA-Z ]+)(?=:):

>>> transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'    
>>> re.findall(r'>> ([a-zA-Z ]+)(?=:)', transcript)
['John doe', 'Sam Smith']
user3483203
  • 50,081
  • 9
  • 65
  • 94