I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it.
df.persist(StorageLevel.MEMORY_AND_DISK)
Whenever we use such persist on a HBase read - the same data is returned again and again for the other subsequent batches of the streaming job but HBase is updated for every batch run.
HBase Read Code:
val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog -> schema)).format(dbSetup.dbClass).load().persist(StorageLevel.MEMORY_AND_DISK)
I replaced persist(StorageLevel.MEMORY_AND_DISK)
with cache()
and it was returning updated records from HBase table as expected.
The reason we tried to use persist(StorageLevel.MEMORY_AND_DISK)
is to ensure that the in-memory storage does not get full and we do not end up doing all transformations all over again during the execution of a particular stream.
Spark Version - 1.6.3 HBase Version - 1.1.2.2.6.4.42-1
Could someone explain me this and help me get a better understanding?