2

I have an application that submits jobs using livy. In the same livy session, various jobs are submitted. At times these jobs might be working on similar datasets, and so I want to reuse data from one job to another. I am caching the dataset in the jobs that I am submitting. But whenever a new job is submitted, it is not picking up the cached dataset, but instead caching the same data all over again.

Is caching a dataset dependent on the variable? Eg, if I do

var d1 = //make some dataset
d1.cache

and in another subsequent job,

var d2 = //same dataset
d2.cache

can I expect there to be only one cached dataset, and d2 to use the previously cached data? Currently I am seeing separate cached data in the storage section of my spark application. For reference, I am using the Livy programmatic API: here for submitting my jobs.

1 Answers1

0

Both d1 and d2 will be created as two different dataframes/datasets in memory. You can't expect them to be the same as they are two different objects. Second thing is, Cache() method is a transformation in Spark and it doesn't necessarily caches the object as soon as the statement is executed. Only when an Action (such as collect, count) comes into picture at a later stage, the entire DAG will be resolved and the caching will be triggered.

Prashant
  • 702
  • 6
  • 21
  • I guess that the crux of @user9024779 question is, does Livy support caching? DBConnect, by databricks, that serves a similar function as Livy of remote connection to Spark/Databricks, does not allow that (even within the same job, at the time of writing this comment, ref: https://docs.databricks.com/user-guide/dev-tools/db-connect.html#limitations) – Ran Feldesh Jun 20 '19 at 19:24
  • I am not sure if I understood this. Livy is a REST interface to submit the code and do session management. caching is the concept of the Spark. – Prashant Jun 22 '19 at 00:41
  • Thanks so much @Prashant ! Agreed :). Since DBconnect is also an interface and does not allow caching on the Spark cluster, I wondered if Livy would similarly not support that (by the way, I am not sure if DBconnect is using Livy under the hood). Thanks :) – Ran Feldesh Jun 23 '19 at 01:24
  • @Prashant Apache Livy has feature defined as: "Share cached RDDs or DataSets across multiple jobs and clients". I believe question was about the meaning of the feature – VB_ Nov 24 '19 at 10:59