1

Today I discovered you can filter a PySpark DataFrame via boolean indexing:

In [3]: df.show()
+-----+---+
|name1|  v|
+-----+---+
| john|1.0|
|  sam|4.0|
|  meh|3.0|
+-----+---+

In [6]: df[df['v']>2.0].show()
+-----+---+
|name1|  v|
+-----+---+
|  sam|4.0|
|  meh|3.0|
+-----+---+

A common way to do this is to use PySpark's filter function, e.g. Spark - SELECT WHERE or filtering?. But is the above syntax documented and officially supported? I like this syntax because it's consistent with that in Pandas (where the filter function means something else entirely).

flow2k
  • 3,999
  • 40
  • 55
  • 1
    It is supported because we're working towards a pandas-like syntax, but I do not think it is officially documented. – pissall Aug 06 '20 at 05:22

0 Answers0