Unable to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi

Question

I am new to Python and Apache ecosystem. I am trying to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi. I do not have detailed knowledge of any of the components being used here, I am only doing Googling and hit-and-trial.

In this way I have successfully configured and started Spark, NiFi and Livy in EMR. And I am able to submit Pyspark code via Livy in interactive session.

However, nothing happens when I configure ExecuteSparkInteractive to submit Pyspark code via Livy. Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor.

This is my configuration for LivySessionController:

This is the sample code I submit under properties in ExecuteSparkInteractive.

import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)

NUM_SAMPLES = 100000

def sample(p):
  x, y = random.random(), random.random()
  return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)

print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

Here is the code that works for me in interactive session:

import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'pyspark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)

#Get the session URL
session_url = host + r.headers['Location']
sn_r = requests.get(session_url, headers=headers)

statements_url = session_url + '/statements'

data = {
  'code': textwrap.dedent("""
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)

NUM_SAMPLES = 100000
def sample(p):
  x, y = random.random(), random.random()
  return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)

print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
""")
}

r = requests.post(statements_url, data=json.dumps(data), headers=headers)

These are the log excerpts from nifi-app.log:

#After starting the processor
2018-07-18 06:38:11,768 INFO [NiFi Web Server-112] o.a.n.c.s.StandardProcessScheduler Starting ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:38:11,770 INFO [Monitor Processore Lifecycle Thread-1] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run with 1 threads
2018-07-18 06:38:11,883 INFO [Flow Service Tasks Thread-1] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController@36fb0996 // Another save pending = false
2018-07-18 06:38:57,106 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog@12830e23 checkpointed with 0 Records and 0 Swap Files in 7 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 2 millis), max Transaction ID -1

#After stopping the processor
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.c.s.StandardProcessScheduler Stopping ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.controller.StandardProcessorNode Stopping processor: class org.apache.nifi.processors.livy.ExecuteSparkInteractive
2018-07-18 06:39:09,838 INFO [Timer-Driven Process Thread-9] o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run
2018-07-18 06:39:09,917 INFO [Flow Service Tasks Thread-2] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController@36fb0996 // Another save pending = false

Interestingly, when I enable LivySessionController in NiFi, the Livy UI shows two new sessions - the one created first shows in "idle" state, while the later (one with the greater Session Id) keeps showing in the "starting" state even after several refreshes. Let's give them Session Ids 1 and 2, respectively. Interestingly, Session Id 2 changes state from "starting" to "shutting_down" to "dead". As soon as it is dead, a new session (Session Id 3) is created with state "starting" which later becomes "idle". Below are log excerpts from these 3 sessions:

#Livy 1st session:
18/07/18 06:33:58 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
18/07/18 06:33:58 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-84-145.ec2.internal:4040
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Shutting down all executors
18/07/18 06:33:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/18 06:33:58 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Stopped
18/07/18 06:33:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/18 06:33:59 INFO MemoryStore: MemoryStore cleared
18/07/18 06:33:59 INFO BlockManager: BlockManager stopped
18/07/18 06:33:59 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/18 06:33:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/18 06:33:59 INFO SparkContext: Successfully stopped SparkContext

#Livy 2nd session:
18/07/18 06:34:30 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

#Livy 3rd session:
18/07/18 06:36:15 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

Are you seeing anything on the `nifi-app.log` file when you start the processor? — Sivaprasanna Sethuraman, Jul 18 '18 at 04:47

score -1 · Answer 1 · answered Jun 22 '19 at 20:32

Few things here -

Livy session controller :- Make sure you see 2 sessions per node when you enable the controller service and both session on spark UI must be in running state (but not performing any operation until python code with Nifi runs). If you see unusual behavior then focus on getting it fixed first. possible action - Add StandardSSLContextService controller and setup Keystore and truststore. And use the same in LivySessionController (under property : SSL COntext Service)

Within Python Code : I think you don't have to import SparkConf, SparkContext, also you don't need to create conf and sc. You only need to import Sparksession as below - from pyspark.sql import SparkSession

and you can simply use spark (it's available by default as spark session variable) e.g - spark.sql(s""" ....slq-statement.. """) or spark.sparkContext for sc

last thing which you mentioned "Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor." FOr this you can add some dummy processor like updateAttribute after ExecuteSparkInteractive processor and keep it in disabled mode. Also you have to direct the output from spark interactive processor to updateAttribute in all 3 states (success, failure, wait). This way you will be able to see whats the outcome after pyspark code runs within nifi. Refer below diagram for sample.

I hope this will help you fix your issues.

Up Vote if you like the answer

Sample Nifi template to test PySpark code

Unable to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi

1 Answers1