0

I have a dataset as follows:

| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |

I also have a list of keywords I'm search for: Keywords = ['hell', 'horrible', 'sucks']

When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.

Current code:


KEYWORDS = 'hell|horrible|sucks'
df = (
            df
            .select(
                F.col('id'),
                F.col('text'),
                F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
            )
)

Current output:

| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |

Expected output:

| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Smithy
  • 39
  • 4

2 Answers2

0

Try below code, I have just change the Keyword only :

from pyspark.sql.functions import col,when


data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id|              text|
+---+------------------+
| 01|       hello world|
| 02|this place is hell|
+---+------------------+

KEYWORDS = '(hell|horrible|sucks)$'

df = (
            df2
            .select(
                col('id'),
                col('text'),
                when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
            )
)
df.show()

+---+------------------+-------------+
| id|              text|keyword_found|
+---+------------------+-------------+
| 01|       hello world|            0|
| 02|this place is hell|            1|
+---+------------------+-------------+

Let me know if you need more help on this.

Mahesh Gupta
  • 1,882
  • 12
  • 16
  • Hey, I attempted adding ()$ to my keyword list. I'm now getting 0s for both ids – Smithy Apr 04 '22 at 13:16
  • @Ant you can try my code and let me know – Mahesh Gupta Apr 04 '22 at 13:20
  • Your code does work. However I found the issue I'm facing. If the keyword hell is placed in the middle of the sentence, it's being ignored. For example "this place hell is" This should also return 1. – Smithy Apr 04 '22 at 13:33
  • @Ant You are using rlike function which is work like if your string is matching and $ is use to find the string end – Mahesh Gupta Apr 04 '22 at 15:00
0

This should work

Keywords = 'hell|horrible|sucks'

df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id text keyword_found
01 hello world 0
02 this place is hell 1
Sudhin
  • 139
  • 7