What are the trade offs of using python syntax instead of spark sql

Question

For example, I have 2 syntaxes that accomplish the same thing on a finance data frame:

Spark SQL

df.filter("Close < 500").show()

PySpark

df.filter(df["Close"] < 500).show()

Is one of them better for any reason like performance, readability or something else I'm not thinking about?

I'm asking because I'm about to start implementing Pyspark in my company and whatever route I chose will probably became cannon there.

Thanks!

There is no performance difference whatsoever between the two statements. Both Spark SQL and the dataframe API end up with the same underlying representation and go through the same optimisation and execution engine. — Hristo Iliev, Sep 01 '22 at 08:18

score 1 · Answer 1 · answered Aug 31 '22 at 19:32

It really depends on your use case, I highly suggest you read these topics so you can have a better idea of what each of these do; I think this covers pretty much all you need to know when it comes to the decision making.

What is PySpark
The difference between Spark and PySpark
What happens when you run PySpark
Spark vs PySpark

Good luck!

score 1 · Accepted Answer · answered Aug 31 '22 at 19:35

I guess it depends on your coworkers: if they mostly use SQL, Spark SQL will have a big selling point (not that this should be the main reason to decide).

For readability and more importantly refactoring possibilities, I would go with plain Dataframes. And if you are concerned about performance, you can always do df.explain() for both options and compare.

This all goes for spark.sql() containing complex queries. For the examples above I do not think it really matters.

What are the trade offs of using python syntax instead of spark sql

2 Answers2