0

I understand that Spark doesn't do anything on a dataframe until a transformation step, but was wondering if the physical location of all the transformation steps mattered.

I was wondering how to read in a rather large file, which I'll only need very little. Let's say I have a 1 TB file but I'll likely only have to read in less than 1 GB of it. If I have a line in the code to filter the dataframe with something basic like df.filter(df['updatedtime'] > '2018-01-01') near the top of the script, that would likely reduce the amount of data read and force predicate pushdown, correct?

What about the scenario where the line of code to filter the data isn't until much later - will that still enforce predicate pushdown and reduce the data read? Or is this a trial-and-error scenario that I need to test myself?

martineau
  • 119,623
  • 25
  • 170
  • 301
simplycoding
  • 2,770
  • 9
  • 46
  • 91

1 Answers1

2

In an ideal situation it shouldn't matter. This is the main advantage over RDD API. Optimizer should be able to rearrange execution plan to achieve optimal performance.

In practice, some operations, varying from version to version, can introduce analysis barrier or disable pushdowns and/or partition pruning.

So if you're in doubt, you should always check execution plan, to confirm that it expected optimizations are applied.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Great, thanks for the answer. Is there a way to see the execution plan in SparkUI or some GUI-based format? I'm seeing `.explain` as the only option to see an execution plan – simplycoding Apr 17 '18 at 16:14