0

I have a PySpark dataframe df that looks like this:

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |30001|F     |2000  |
+---------+----------+--------+-----+------+------+

I need to apply a filter of id > 4000 only to gender = M, and preserve all the gender = F. Therefore, the final dataframe should look like this:

 +---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |30001|F     |2000  |
+---------+----------+--------+-----+------+------+

The only way I can think of doing this is:

df_temp1 = df.filter(df.gender == 'F')
df_temp2 = df.where(df.gender == 'M').filter(df.id > 4000)
df = df_temp1.union(df_temp2)

Is this the most efficient way to do this? I'm new to Spark so any help is appreciated!

  • 1
    Welcome to SO. Would `df.filter("(gender == 'M' and id > 40000) or gender == 'F'")` help? – werner Jul 06 '21 at 19:10
  • Although werner's answer is perfect and precise. I would rather suggest to use the same thing you mentioned in your Question , at least it's more readable. Because Spark will probably end up doing the same work / end up using the code which werner wrote because of the the Catalyst optimizations . There won't be any performance improvement. – Sanket9394 Jul 06 '21 at 20:32
  • pls see the answer below with filter() and let us know if that helps you - appreciate if you can help accept and upvote – dsk Jul 07 '21 at 05:40

2 Answers2

0

This should do the trick. where is an alias for filter.

>>> df.show()
+-------+------+-----+
|   name|gender|   id|
+-------+------+-----+
|  James|     M|36636|
|Michael|     M|40288|
| Robert|     F|42114|
|  Maria|     F|39192|
|    Jen|     F|30001|
+-------+------+-----+

>>> df.where(''' (gender == 'M' and id > 40000) OR gender == 'F' ''').show()
+-------+------+-----+
|   name|gender|   id|
+-------+------+-----+
|Michael|     M|40288|
| Robert|     F|42114|
|  Maria|     F|39192|
|    Jen|     F|30001|
+-------+------+-----+
Bala
  • 11,068
  • 19
  • 67
  • 120
0

use both the condition using OR

**

df = spark.createDataFrame([(36636,"M"),(40288,"M"),(42114,"M"),(39192,"F"),(30001,"F")],["id","gender"])
df = df.filter(((F.col("id") > 40000) & (F.col("gender") == F.lit("M"))) | (F.col("gender") == F.lit("F")))
df.show()

** Output

+-----+------+
|   id|gender|
+-----+------+
|40288|     M|
|42114|     M|
|39192|     F|
|30001|     F|
+-----+------+
dsk
  • 1,863
  • 2
  • 10
  • 13
  • Thanks! Could you tell me what the .lit() does? – Kruttika Swaminathan Jul 07 '21 at 12:37
  • PySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value. Both these functions return Column type as return… So, assume here you need to compare with some string - so in that case wrap that with lit() – dsk Jul 08 '21 at 05:58
  • Can you please accept and upvote my answer - this helps :) – dsk Jul 08 '21 at 05:58