1

My list contains some words like : [‘orange’, ‘cool’, ‘app’....] and I want to output all these exact whole words (if available) from a description column in a DataFrame.

I have also attached a sample picture with code. I used str.findall() The picture shows, it extracts add from additional, app from apple. However, I do not want that. It should only output if it matches the whole word. enter image description here

Thân LƯƠNG Đình
  • 3,082
  • 2
  • 11
  • 21
ShrestR
  • 285
  • 1
  • 3
  • 8

1 Answers1

1

You can fix the code using

df['exactmatch'] = df['text'].str.findall(fr"\b({'|'.join(list1)})\b").str.join(", ")

Or, if there can be special chars in your list1 words,

df['exactmatch'] = df['text'].str.findall(fr"(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)").str.join(", ")

The pattern created by fr"\b({'|'.join(list1)})\b" and fr"(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)" will look like

\b(orange|cool|app)\b
(?<!\w)(orange|cool|app)(?!\w)

See the regex demo. Note .str.join(", ") is considered faster than .apply(", ".join).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you! However, if my text also has word with hyphen, for e.g. additional-material, or plural e.g. apples, how can I modify my search instead of having “additional-material and apples” in my list1 but still get the output additional material and apple. Thanks! – ShrestR Oct 11 '20 at 12:55
  • @ShrestR Try `r"(?<!\w)(" + '|'.join([re.escape(x).replace('\\ ', r'[\s-]') for x in list1]) + r")"` – Wiktor Stribiżew Oct 11 '20 at 15:43
  • Hi, how to do the same exact match operation in pyspark df? below is for pandas: df['exactmatch'] = df['text'].str.findall(fr"(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)").str.join(", ") – ShrestR Dec 01 '20 at 06:19
  • @ShrestR I do not know pyspark well, I think you should use a `pyspark.sql.functions.regexp_replace` like `regexp_replace(col, fr"(?s)(?<!\w)({'|'.join(map(re.escape, list1))})(?!\w)|.?", r"\1, ")` and this value should be also replaced with `regexp_replace(, '^(?:, )+|(?:, )+$|(, )+', r'\1')` – Wiktor Stribiżew Dec 01 '20 at 08:52