I am attempting to utilize the Cerner Bunsen package for FHIR processing in PySpark on an AWS EMR, specifically the Bundles class and it's methods. I am creating the spark session using the Apache Livy API,
def create_spark_session(master_dns, kind, jars):
# 8998 is the port on which the Livy server runs
host = 'http://' + master_dns + ':8998'
data = {'kind': kind, 'jars': jars}
headers = {'Content-Type': 'application/json'}
response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
logging.info(response.json())
return response.headers
Where kind = pyspark3 and jars is an S3 location that houses the jar (bunsen-shaded-1.4.7.jar)
The data transformation is attempting to import the jar and call the methods via:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen.Bundles")
func = sc._gateway.jvm.Bundles()
The error I am receiving is
"py4j.protocol.Py4JError: An error occurred while calling None.com.cerner.bunsen.Bundles. Trace:\npy4j.Py4JException: Constructor com.cerner.bunsen.Bundles([]) does not exist"
This is the first time I have attempted to use java_import so any help would be appreciated.
EDIT: I changed up the transformation script slightly and am now seeing a different error. I can see the jar being added in the logs so I am certain it is there and that the jars: jars functionality is working as intended. The new transformation is:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Manage logging
#sc.setLogLevel("INFO")
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen")
func_main = sc._gateway.jvm.Bundles
func_deep = sc._gateway.jvm.Bundles.BundleContainer
fhir_data_frame = func_deep.loadFromDirectory(spark,"s3://<bucket>/source_database/Patient",1)
fhir_data_frame_fromJson = func_deep.fromJson(fhir_data_frame)
fhir_data_frame_clean = func_main.extract_entry(spark,fhir_data_frame_fromJson,'patient')
fhir_data_frame_clean.show(20, False)
and the new error is:
'JavaPackage' object is not callable
Searching for this error has been a bit futile, but again, if anyone has ideas I will gladly take them.