Today I discovered you can filter a PySpark DataFrame via boolean indexing:
In [3]: df.show()
+-----+---+
|name1| v|
+-----+---+
| john|1.0|
| sam|4.0|
| meh|3.0|
+-----+---+
In [6]: df[df['v']>2.0].show()
+-----+---+
|name1| v|
+-----+---+
| sam|4.0|
| meh|3.0|
+-----+---+
A common way to do this is to use PySpark's filter
function, e.g. Spark - SELECT WHERE or filtering?. But is the above syntax documented and officially supported? I like this syntax because it's consistent with that in Pandas (where the filter
function means something else entirely).