1

I have done good analysis on SO and other forums and I have seen solutions on how to handle plurals but here it is the case of handling if the words are passed from excel.

I have a long list of keywords and I am passing that list to my regex like below:

df = pd.read_excel('\\Keywords.xlsx', sheet_name=0)
keyword_list = df['Keyword_List'].tolist()
keywords_regex =(r'(({0})\b)'.format('|'.join(keyword_list)))

I have to keep \b in the end because I have words like "Meet" and don't want words like "Meeting" to be matched.

I have a huge text paragraph and I want to check how many of the words in my keyword list occurs including the plurals. So, if the paragraph contains a word like "Boy" and "Boys" both I want both. Currently the below code is working only for singular:

matches = re.findall(keywords_regex, text, re.IGNORECASE) ## text is the long paragraph

I can always write plural forms of words in the excel to get the match but I am looking of there is any we can handle at regex or python level only

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
  • I'm not sure regex engines are aware that if you add a *foot* to another *foot* you'll end up with to *feet* – Thomas Ayoub Sep 17 '18 at 13:10
  • What about `countries`? You will also have to modify word base for some words (replace final `y` after consonants with `i`), and account for `es` endings. – Wiktor Stribiżew Sep 17 '18 at 13:10
  • Might be a good idea to stem the text in your document, and have a keyword list of only stems. That should get you decently accurate results. Here's a link that might be helpful: http://www.nltk.org/howto/stem.html – Wiggy A. Sep 17 '18 at 13:14
  • 1
    [This question](https://stackoverflow.com/questions/18902608/generating-the-plural-form-of-a-noun/19018986)'s answers provide multiple libraries that seem to solve the pluralization problem. It's a bit old though, so I'm not sure they will still be relevant nowadays. – Aaron Sep 17 '18 at 13:15
  • @Aaron: This helps to certain extent.. – Rahul Agarwal Sep 17 '18 at 13:18
  • I think the biggest performance hit here will be using the huge regex to parse your input. If you need to improve performances I'd look into a case-insensitive full-word search engine to get rid of regex. – Aaron Sep 17 '18 at 13:25

0 Answers0