I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for: Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike()
or .contains()
, sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |