Filter out words depending on surrounding punctuation

Question

Objective:

I'm looking for a way to match or skip words based on whether or not they are surrounded by quotations marks ' ', guillemets « » or parentheses ( ).

Examples of desired results:

len(re.findall("my word", "blablabla 'my word' blablabla")) should return 0 because linguistically speaking my word =/= 'my word' and hence shouldn't be matched;
len(re.findall("'my word'", "blablabla 'my word' blablabla")) should return 1 because linguistically speaking 'my word' = 'my word' and hence should be matched;
But here's the catch — both len(re.findall("my word", "blablabla «my word» blablabla")) and len(re.findall("my word", "blablabla (my word) blablabla")) should return 1.

My attempt:

I have the following expression (correct me if I'm wrong) at my disposal but am clueless as to how to implement it: (?<!\w)'[^ ].*?\w*?[^ ]'

I wish to make the following code len(re.findall(r'(?<!\w)'+re.escape(myword)+r'(?!\w)', sentence)) – whose aim is to strip out punctuation marks I believe – take into account all of the aforementioned situations.

For now, my code detects my word inside of 'my word' which is not what I want.

Thanks in advance!

If string is `blabla 'bla my word bla' blabla`, should matching with `my word` return 0 or 1? — Prasanna, Dec 19 '20 at 02:43
Interesting question indeed! It should return `1` since `my word` in the whole text isn't directly surrounded by `' '` — Y H R, Dec 19 '20 at 14:54

sophros · Accepted Answer · 2020-12-19T08:37:38.683

I think one of the strategies is to use negative look-ahead feature:

my_word = "word"
r"(?!'" + my_word + "')[^']" + "my_word"

This should do the job as you can check here.

Since negative look-ahead does not consume characters, to prevent a match you need to use [^'] to ensure the quotation mark ' is not an allowed character preceding your my_word. The ^ starting an enumeration of characters means precisely that.

If you want to expand the list of quotation marks that should cause the word not to be counted as found it is enough that you change ' into a list of disallowed characters:

r"(?!['`]" + my_word + "['`])[^'`]my_word"

It is worth noting that the example from @Prasanna question is going to be impossible to match using regex. You would need to use a proper parser - e.g. pyparsing - to handle such situations because regular expressions are not able to handle a match that requires two arbitrary counts of characters to match (e.g. any number of 'a' followed by the same number of 'b' letters) and it will not be possible to create a generic regular expression with a look-ahead that handles n words then myword and at the same time skips n words if they are preceded by a quotation mark).

Thank you for this elaborate explanation, I appreciate your effort! I was unaware of `pyparsing` but I might have to resort to it later on, because in addition to the provided examples there are other requirements I'd need to meet, it gets tricker because of all the various unit tests I haven't mentioned here — Y H R, Dec 19 '20 at 15:12

Filter out words depending on surrounding punctuation

1 Answers1