Regular expression - Python [list query]

Question

I am trying to write a regular expression for this list:

data= ["Fred is Deputy Manager. He is working for MNC.", "Rita is another employee in AC Corp."]

And I want to delete all the words that starts with an uppercase letter but it should not check the first word of every sentence ie, it should not check for Fred, He and Rita.

The output should be

Output-["Fred is. He is working for.", "Rita is another employee in."]

I tried looking for solution but couldn't find any relevant code. Any help would be appreciated.

Thanks.

I tried this ```answer = [re.sub(r'([^.])([A-Z]\w*)', r'\1'|,sent) for d in data] ``` but it's not working — Sanya, Jun 25 '20 at 05:41
To make this a better question, please [edit] it, add the code you tried and explain the point _"but it's not working"_ in more detail. Thanks.. — Patrick Artner, Jun 25 '20 at 05:53

Derek O · Accepted Answer · 2020-06-25T06:09:38.127

You will need to find and remove all capital words not following punctuation, then find and remove trailing spaces (this solution isn't the cleanest but it works). List comprehensions come in handy here as well.

import re

data = ["Fred is Deputy Manager. He is working for MNC.", "Rita is another employee in AC Corp."]
# find and replace all capital words that don't follow punctuation with ''
text = [re.sub(r'(?<!\.\s)(?!^)\b([A-Z]\w*(?:\s+[A-Z]\w*)*)', '', item) for item in data]
# find and remove all trailing spaces before periods
output = [re.sub(r'\s([?.!"](?:\s|$))', r'\1', item) for item in text]

>>> output
['Fred is. He is working for.', 'Rita is another employee in.']

score 2 · Answer 2 · answered Jun 25 '20 at 05:43

First, let me just apologize for how unhelpful the regular expressions documentation for python 3 is. All the info to answer this question is can technically be found here, but you already need to know a bit about how re works to make sense of it. That being said, hopefully this will give you a leg up:

A simple answer

Here's some code you could try:

import re

data = ["Fred is Deputy Manager. He is working for MNC.", "Rita is another employee in AC Corp."]

matcher = re.compile("(?<![.])[ ][A-Z][A-z]*")
print([matcher.sub("",d) for d in data])
# prints: ['Fred is. He is working for.', 'Rita is another employee in.']

Basically, this compiles a regular expression which will match capital words not following a period:

(?<![.]) -> don't match if preceded by a period
[ ][A-Z][A-z]* -> any capitalized word (which has a leading space, which makes sure if never matches the first word in the string)

Then, it applies that regular expression to each string in your list and replaces the matches with the empty string: ""

Some Limitations

If your strings ever have double spaces or other whitespace characters (like tabs or carriage returns) that will break this. You can fix that by instead using:

matcher = re.compile("(?<![.])\s+[A-Z][A-z]*")

where \s+ will match one or more whitespace characters

Also, if your strings ever lead off with a space, that will also break this. You can fix that by using:

print([matcher.sub("",d.strip(" ")) for d in data])

to remove the leading or trailing whitespace characters from your string.

Regular expression - Python [list query]

2 Answers2

A simple answer

Some Limitations