Ii am trying to submit a PySpark job to Livy using the /batches endpoint, but I haven't found any good documentation. Life has been easy because we are submitting Scala-compiled JAR files to Livy, and specifying the job with className.
For the JAR file, we use:
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
I am unsure how to submit a PySpark job in a similar manner, where the package also has some relative imports...any thoughts?
For reference, the folder structure is below:
bar2
- __init__.py
- foo2.py
- bar3
- __init__.py
- foo3.py
I would want to then run:
from foo2 import ClassFoo
class_foo = ClassFoo(arg1, arg2)
class_foo.auto_run()