Running Custom Java Class in PySpark on EMR

Question

I am attempting to utilize the Cerner Bunsen package for FHIR processing in PySpark on an AWS EMR, specifically the Bundles class and it's methods. I am creating the spark session using the Apache Livy API,

def create_spark_session(master_dns, kind, jars):
    # 8998 is the port on which the Livy server runs
    host = 'http://' + master_dns + ':8998'
    data = {'kind': kind, 'jars': jars}
    headers = {'Content-Type': 'application/json'}
    response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
    logging.info(response.json())
    return response.headers

Where kind = pyspark3 and jars is an S3 location that houses the jar (bunsen-shaded-1.4.7.jar)

The data transformation is attempting to import the jar and call the methods via:

# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()

# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen.Bundles")
func = sc._gateway.jvm.Bundles()

The error I am receiving is

"py4j.protocol.Py4JError: An error occurred while calling None.com.cerner.bunsen.Bundles. Trace:\npy4j.Py4JException: Constructor com.cerner.bunsen.Bundles([]) does not exist"

This is the first time I have attempted to use java_import so any help would be appreciated.

EDIT: I changed up the transformation script slightly and am now seeing a different error. I can see the jar being added in the logs so I am certain it is there and that the jars: jars functionality is working as intended. The new transformation is:

# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()

# Manage logging
#sc.setLogLevel("INFO")

# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen")
func_main = sc._gateway.jvm.Bundles
func_deep = sc._gateway.jvm.Bundles.BundleContainer

fhir_data_frame = func_deep.loadFromDirectory(spark,"s3://<bucket>/source_database/Patient",1)
fhir_data_frame_fromJson = func_deep.fromJson(fhir_data_frame)
fhir_data_frame_clean = func_main.extract_entry(spark,fhir_data_frame_fromJson,'patient')
fhir_data_frame_clean.show(20, False)

and the new error is:

'JavaPackage' object is not callable

Searching for this error has been a bit futile, but again, if anyone has ideas I will gladly take them.

score 1 · Answer 1 · answered Jan 22 '20 at 16:57

1

If you want to use a Scala/Java function in Pyspark you have also to add the jar package in classpath. You can do it with 2 different ways:

Option1: In Spark submit with the flag --jars

 spark-submit example.py --jars /path/to/bunsen-shaded-1.4.7.jar

Option2: Add it in spark-defaults.conf file in property:

Add the following code in : path/to/spark/conf/spark-defaults.conf

# Comma-separated list of jars include on the driver and executor classpaths. 
spark.jars /path/to/bunsen-shaded-1.4.7.jar

answered Jan 22 '20 at 16:57

ggeop

1,230
12
24

I believe the first option is carried over in the livy rest api call via the data = {'kind': kind, 'jars': jars}, where jars = bunsen-shaded-1.4.7.jar--Is that not the case? Reference: https://livy.incubator.apache.org/docs/latest/rest-api.html, POST /sessions call – user1983682 Jan 22 '20 at 17:39
1

@user1983682 I don't have used livy api, but it look to do the same job. Your ERROR is very clear, you have not added the jar in the classpath. Maybe you have a typo, wrong path, error in packaging something like this. Do you have access in the spark UI? If you have you can go to environments tab and see if you have added right – ggeop Jan 22 '20 at 19:09
It is definitely there, I can see it getting added in the logs. I changed up the data transformation slightly and now am getting a different error. I will add an edit above. – user1983682 Jan 24 '20 at 13:20
@user1983682 there is a simple solution in your problem have you looked this documentation:https://engineering.cerner.com/bunsen/0.4.6/introduction.html – ggeop Jan 24 '20 at 16:40
Now the jar looks to be in the write place, but the spark can't find it. In my view the importing path looks suspicious. – ggeop Jan 24 '20 at 16:42
Yeah, I did try their simple solution but unfortunately it didn't seem compatible with the bigger picture (Apache Airflow launching the EMR and using Livy to call Spark). I've made further changes (removing the jars parameter from Livy and instead using bootstrapping to get the file, then EMR Configurations to set the spark-defaults. The error is now that Constructor does not exist so I think you are right about the importing path or something about calling the Class. Working on that and will report back. Thanks again, @ggeop! – user1983682 Jan 25 '20 at 17:35
Yes try to investigate the importing path and you will find it, I think you are close now with the new ERROR! Keep in touch! Inform me when you will find it – ggeop Jan 25 '20 at 17:51

Running Custom Java Class in PySpark on EMR

1 Answers1