reusable sparkcontext instance

Question

I'm quite new to Big Data and currently, I'm working on a CLI project that performs some text parsing using apache spark.

When a command is typed, a new sparkcontext is instantiated and some files are read from a hdfs instance. However, the spark is taking too much time to initialize a sparkcontext or even a sparksession object.

So, my question is:- Is there a way to reuse a sparkcontext instance between these commands to reduce this overhead? I've heard about spark job server, but it's been too hard to deploy a local server since its main guide is a bit confusing.

Thank you.

P.S.: I'm using pyspark

Why not just starting it once and the don't close it until the job is finished? — Dat Tran, Sep 19 '17 at 03:04
@DatTran From the OP it seems likely he runs a Spark Job to completion and then wants to initiate a new Job -but without restarting the Spark Context — WestCoastProjects, Sep 19 '17 at 04:55

score 1 · Answer 1 · edited Jul 17 '18 at 01:23

1

This is probably not a good idea because your intermediate shuffle files never get cleaned up unless you explicity call rdd.unpersist(). If the shuffle files don't get cleaned up, over a period of time, you will start running into disk space issues on the cluster.

edited Jul 17 '18 at 01:23

Stephen Rauch

47,830
31
106
135

answered Jul 17 '18 at 01:03

user1306140

11
1

reusable sparkcontext instance

1 Answers1