Select specific columns in a PySpark dataframe to improve performance

Question

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with

df = SqlContext.sql('select cols from mytable')

and I'm importing the entire table with

df = SqlContext.table(mytable)

does a select and subsequent cache improves performance/decrease memory usage, like

df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()

or is just waste of time? I will do lots of operations and data manipulations on df, like avg, withColumn, etc.

Yes it does, if you don't need other columns in the following steps - so you'll have only needed columns in cache — MaxU - stand with Ukraine, Jun 21 '16 at 15:33

score 1 · Answer 1 · answered Jun 21 '16 at 15:35

1

IMO it makes sense to filter them beforehand:

df = SqlContext.sql('select col_1, col_2, col_3 from mytable')

so you won't waste resources...

If you can't do it this way, then you can do it as you did it...

answered Jun 21 '16 at 15:35

MaxU - stand with Ukraine

205,989
36
386
419

score 1 · Answer 2 · answered Jun 21 '16 at 16:08

1

It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.

Also it is worth noting that using df.count() after df.cache() will be useless most of the time (if not always). In general count is rewritten by the optimizer as

SELECT SUM(1) FROM table

so what is typically requested from the source is:

SELECT 1 FROM table

Long story short there is nothing useful to cache here.

answered Jun 21 '16 at 16:08

zero323

322,348
103
959
935

Thanks again, @zero323! I was thinking more on the lines of doing a count just after a cache to commit the cache operation and check a few numbers as a byproduct. I noticed that sometimes doing a `select` makes the performance *worse* before the cache. – Ivan Jun 21 '16 at 16:33
Well, reasoning about cache in Spark SQL is relatively hard and tricks like `cache` / `count` are not the best idea. You can see some performance boost which is not related to Spark caching when data is for example memory mapped but IMHO it is more a ritual then anything else. – zero323 Jun 21 '16 at 16:41

Select specific columns in a PySpark dataframe to improve performance

2 Answers2