I have a PySpark dataframe df that looks like this:
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|James | |Smith |36636|M |3000 |
|Michael |Rose | |40288|M |4000 |
|Robert | |Williams|42114|M |4000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
I need to apply a filter of id > 4000 only to gender = M, and preserve all the gender = F. Therefore, the final dataframe should look like this:
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael |Rose | |40288|M |4000 |
|Robert | |Williams|42114|M |4000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
The only way I can think of doing this is:
df_temp1 = df.filter(df.gender == 'F')
df_temp2 = df.where(df.gender == 'M').filter(df.id > 4000)
df = df_temp1.union(df_temp2)
Is this the most efficient way to do this? I'm new to Spark so any help is appreciated!