Pyspark: How to filter dataframe based on string and absence of prefix?

Question

I have a pyspark dataframe like so with rows that have 'zodiac' and rows that have 'nonzodiac'

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac'), 
        (2, '1234ESPNzodiac'),
        (3, '963CNNnonzodiac'), 
        (4, '963CNNzodiac'),
    ],
    ['id', 'col1'] 
)

I can get all the nonzodiac rows like so:

nonzodiac_rows = df.where(f.col("col1").rlike("nonzodiac"))

But I'm having trouble returning only rows with 'zodiac' since doing something similar returns both zodiac and nonzodiac rows

zodiac_rows = df.where(f.col("col1").rlike("zodiac"))

leftjoin · Accepted Answer · 2021-02-03T09:12:27.510

2

It is because nonzodiac contains zodiac substring. You need to write more strict regexp For example add one more character:

zodiac_rows = df.where(f.col("col1").rlike("Nzodiac"))

Or restrict non before zodiac

zodiac_rows = df.where(f.col("col1").rlike("(?<!non)zodiac"))

edited Feb 03 '21 at 09:12

answered Feb 02 '21 at 21:50

leftjoin

36,950
8
57
116

1

`(?!non)zodiac` = `zodiac` because `zod` is not equal to `non`. – Wiktor Stribiżew Feb 03 '21 at 08:59

score 0 · Answer 2 · answered Feb 02 '21 at 22:12

You can do the opposite of your "nonzodiac" where with the logical notoperator

zodiac_rows = df.where(~f.col("col1").rlike("nonzodiac"))

Besides, there is the filter function that may help you in a different way

zodiac_rows = df.filter(~f.col('col1').contains('nonzodiac')).collect()

Pyspark: How to filter dataframe based on string and absence of prefix?

2 Answers2