How to submit PySpark and Python jobs to Livy

Question

Ii am trying to submit a PySpark job to Livy using the /batches endpoint, but I haven't found any good documentation. Life has been easy because we are submitting Scala-compiled JAR files to Livy, and specifying the job with className.

For the JAR file, we use:

data={
    'file': 's3://foo-bucket/bar.jar',
    'className': 'com.foo.bar',
    'jars': [
        's3://foo-bucket/common.jar',
    ],
    'args': [
        bucket_name,
        'https://foo.bar.com',
        "oof",
        spark_master
    ],
    'name': 'foo-oof bar',
    'driverMemory': '2g',
    'executorMemory': '2g',
    'driverCores': 1,
    'executorCores': 3,
    'conf': {
        'spark.driver.memoryOverhead': '600',
        'spark.executor.memoryOverhead': '600',
        'spark.submit.deployMode': 'cluster'
}

I am unsure how to submit a PySpark job in a similar manner, where the package also has some relative imports...any thoughts?

For reference, the folder structure is below:

bar2
- __init__.py
- foo2.py
- bar3
  - __init__.py
  - foo3.py

I would want to then run:

from foo2 import ClassFoo
class_foo = ClassFoo(arg1, arg2)
class_foo.auto_run()

Nobody answered? I am also trying to find out how to submit pyspark job with python dependencies via Livy. — Z.Wei, May 16 '19 at 02:44

Neha Jirafe · Answer 1 · 2019-06-26T08:02:32.413

You can try passing pyFiles

data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
    's3://foo-bucket/common.jar',
],
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
'args': [
    bucket_name,
    'https://foo.bar.com',
    "oof",
    spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
    'spark.driver.memoryOverhead': '600',
    'spark.executor.memoryOverhead': '600',
    'spark.submit.deployMode': 'cluster'

}

In the above example

"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]

I have tried saving the files on the master node via bootstraping , but noticed that Livy would send the request randomly to the slave nodes where the files might not be present.

Also you may pass the files as a .zip,Although I havent tried it

score 0 · Answer 2 · answered Jun 03 '19 at 15:36

You need to submit with file being the main Python executable, and pyFiles being the additional internal libraries that are being used. My advice would be to provision the server with a bootstrap action which copies your own libraries over, and installs the pip-installable libraries on the master and nodes.

How to submit PySpark and Python jobs to Livy

2 Answers2