0

For example, I have 2 syntaxes that accomplish the same thing on a finance data frame:

Spark SQL

df.filter("Close < 500").show()

PySpark

df.filter(df["Close"] < 500).show()

Is one of them better for any reason like performance, readability or something else I'm not thinking about?

I'm asking because I'm about to start implementing Pyspark in my company and whatever route I chose will probably became cannon there.

Thanks!

A Campos
  • 753
  • 3
  • 10
  • 31
  • 1
    There is no performance difference whatsoever between the two statements. Both Spark SQL and the dataframe API end up with the same underlying representation and go through the same optimisation and execution engine. – Hristo Iliev Sep 01 '22 at 08:18

2 Answers2

1

It really depends on your use case, I highly suggest you read these topics so you can have a better idea of what each of these do; I think this covers pretty much all you need to know when it comes to the decision making.

  1. What is PySpark
  2. The difference between Spark and PySpark
  3. What happens when you run PySpark
  4. Spark vs PySpark

Good luck!

vilalabinot
  • 1,420
  • 4
  • 17
1

I guess it depends on your coworkers: if they mostly use SQL, Spark SQL will have a big selling point (not that this should be the main reason to decide).

For readability and more importantly refactoring possibilities, I would go with plain Dataframes. And if you are concerned about performance, you can always do df.explain() for both options and compare.

This all goes for spark.sql() containing complex queries. For the examples above I do not think it really matters.

bzu
  • 1,242
  • 1
  • 8
  • 14