I have data stored in orc in s3 partitioned like so: mydata/year=2017/month=8/day=1
I've created an external table in spark on top of this table:
spark.catalog().createExternalTable(tableName, s3Path, "orc");
This allows me to query data in that bucket using sql over spark thrift and I can run queries that specify the partition so I get results back faster like so:
select count(*) from mytable where day = '1';
I can also cache the table in memory for faster results:
cache table mytable
However, I'd like to be able to partially cache the table because I don't have enough memory to cache all of the data. I have 1 year of data however, most of my queries are against the last month of data. Is there any way to specify which partitions to cache in memory with the cache table command?
I also realize I can do this by creating a view, however I was wondering if there was a way to do this, without creating a view:
CREATE VIEW mytable_thismonth AS SELECT * FROM mytable where month = '8'