I understand that Spark doesn't do anything on a dataframe until a transformation step, but was wondering if the physical location of all the transformation steps mattered.
I was wondering how to read in a rather large file, which I'll only need very little. Let's say I have a 1 TB file but I'll likely only have to read in less than 1 GB of it. If I have a line in the code to filter the dataframe with something basic like df.filter(df['updatedtime'] > '2018-01-01')
near the top of the script, that would likely reduce the amount of data read and force predicate pushdown, correct?
What about the scenario where the line of code to filter the data isn't until much later - will that still enforce predicate pushdown and reduce the data read? Or is this a trial-and-error scenario that I need to test myself?