Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with
df = SqlContext.sql('select cols from mytable')
and I'm importing the entire table with
df = SqlContext.table(mytable)
does a select
and subsequent cache
improves performance/decrease memory usage, like
df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()
or is just waste of time? I will do lots of operations and data manipulations on df
, like avg
, withColumn
, etc.