Finding exact word in description column of DataFrame in Python

Question

My list contains some words like : [‘orange’, ‘cool’, ‘app’....] and I want to output all these exact whole words (if available) from a description column in a DataFrame.

I have also attached a sample picture with code. I used str.findall() The picture shows, it extracts add from additional, app from apple. However, I do not want that. It should only output if it matches the whole word.

score 1 · Answer 1 · answered Oct 09 '20 at 17:03

1

You can fix the code using

df['exactmatch'] = df['text'].str.findall(fr"\b({'|'.join(list1)})\b").str.join(", ")

Or, if there can be special chars in your list1 words,

df['exactmatch'] = df['text'].str.findall(fr"(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)").str.join(", ")

The pattern created by fr"\b({'|'.join(list1)})\b" and fr"(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)" will look like

\b(orange|cool|app)\b
(?<!\w)(orange|cool|app)(?!\w)

See the regex demo. Note .str.join(", ") is considered faster than .apply(", ".join).

answered Oct 09 '20 at 17:03

Wiktor Stribiżew

607,720
39
448
563

Thank you! However, if my text also has word with hyphen, for e.g. additional-material, or plural e.g. apples, how can I modify my search instead of having “additional-material and apples” in my list1 but still get the output additional material and apple. Thanks! – ShrestR Oct 11 '20 at 12:55
@ShrestR Try `r"(?<!\w)(" + '|'.join([re.escape(x).replace('\\ ', r'[\s-]') for x in list1]) + r")"` – Wiktor Stribiżew Oct 11 '20 at 15:43
Hi, how to do the same exact match operation in pyspark df? below is for pandas: df['exactmatch'] = df['text'].str.findall(fr"(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)").str.join(", ") – ShrestR Dec 01 '20 at 06:19
@ShrestR I do not know pyspark well, I think you should use a `pyspark.sql.functions.regexp_replace` like `regexp_replace(col, fr"(?s)(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)|.?", r"\1, ")` and this value should be also replaced with `regexp_replace(, '^(?:, )+|(?:, )+$|(, )+', r'\1')` – Wiktor Stribiżew Dec 01 '20 at 08:52

Finding exact word in description column of DataFrame in Python

1 Answers1