3

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it.

df.persist(StorageLevel.MEMORY_AND_DISK)

Whenever we use such persist on a HBase read - the same data is returned again and again for the other subsequent batches of the streaming job but HBase is updated for every batch run.

HBase Read Code:

val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog -> schema)).format(dbSetup.dbClass).load().persist(StorageLevel.MEMORY_AND_DISK)

I replaced persist(StorageLevel.MEMORY_AND_DISK) with cache() and it was returning updated records from HBase table as expected.

The reason we tried to use persist(StorageLevel.MEMORY_AND_DISK) is to ensure that the in-memory storage does not get full and we do not end up doing all transformations all over again during the execution of a particular stream.

Spark Version - 1.6.3 HBase Version - 1.1.2.2.6.4.42-1

Could someone explain me this and help me get a better understanding?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Dasarathy D R
  • 335
  • 2
  • 7
  • 20
  • @JacekLaskowski - As requested! Also, what I mean by saying other subsequent batches is, the next batch after a set interval of a spark streaming job. – Dasarathy D R Aug 27 '18 at 07:23
  • How is "HBase Read Code" used in a Spark Streaming job? Please include the code in your question. Is this `foreach` or something similar? Any reasons to stick with 1.6.3? I doubt it gets lots of traction (if any at all). – Jacek Laskowski Aug 27 '18 at 10:51
  • Here is something I also faced: https://stackoverflow.com/questions/51791008/spark-application-returns-different-results-based-on-different-executor-memory – Avishek Bhattacharya Aug 27 '18 at 11:34
  • @JacekLaskowski I have used the same code in my Spark Streaming Job also. There is no `foreach` or something like that. To provide an example of how we read HBase, pls refer to the below link. [link]https://github.com/hortonworks-spark/shc/blob/master/examples/src/main/scala/org/apache/spark/sql/execution/datasources/hbase/HBaseSource.scala Check on **withCatalog** method in the above link. That should give you a better idea. – Dasarathy D R Aug 28 '18 at 07:21
  • @JacekLaskowski - No particular reason to stick with 1.6.3 - its just that our cluster is built with that particular version. Also, I hope my answers or responses are inline with your questions. If not, pls feel free to elaborate. – Dasarathy D R Aug 28 '18 at 07:23
  • @AvishekBhattacharya I went through the link that you shared. Apologies for such a delay. Since I was moved to a different module, the delay. Just got back on track with this work. Even though I understand what is mentioned in that link, is that a concrete reason for this behavior? Or do we have anything else to look into. – Dasarathy D R Mar 20 '19 at 04:32

1 Answers1

1

As you mentioned you are looking for a reason "why" therefore I'm answering this because otherwise this question will remain unanswered as there's no rational reason these days to run spark 1.6.3 to sense what happens with that specific HBASE version.

Internally, spark calls persist() when you use cache() and it behaves differently on RDDs than on Datasets(or Dataframes). On RDDs it uses MEMORY_ONLY and on Datasets, MEMORY_AND_DISK.I cant see what you've coded(fully) but generally I can say, you shouldn't have face the difference between the two ways of cache and persist and your issue is simply a version incompatibility btw those or simply a bug that wasn't fixed by Apache.

There are several places to check to see what's wrong

In this link https://spark.apache.org/releases/spark-release-1-6-3.html you can find that maintainance of the code is hapening in branch 1.6 so this is the place to find the code https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/CacheManager.scala

Hope it helped.

Aramis NSR
  • 1,602
  • 16
  • 26