0

using the following example on a large table:

pages = spark.sql('select * from table xx'), I found that the query runs in seconds, but as soon as I want to see the data with pages.show(n=10) it takes minutes to get the data to have a sample of that data. What is happening under the hood to be so slow.

the SQL (spark.sql) command takes < 1 second but the pages.show(n=10) takes minutes.

disruptive
  • 5,687
  • 15
  • 71
  • 135

1 Answers1

2

Spark does lazy evaluation so it won't start actually executing the command (e.g. select * from table xx) until an 'action' is call (e.g. .show(), .write or display() in Databricks).

The part that is running <1 sec is the evaluation—it's checking to see if the command can be executed, but not actually executing until an action.

Related reads on Transformation vs Actions with Spark:

Amelia N Chu
  • 323
  • 1
  • 5
  • 11