Can I add arguments to python code when I submit spark job?

Question

I'm trying to use spark-submit to execute my python code in spark cluster.

Generally we run spark-submit with python code like below.

# Run a Python application on a cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  my_python_code.py \
  1000

But I wanna run my_python_code.pyby passing several arguments Is there smart way to pass arguments?

score 59 · Answer 1 · edited Apr 11 '19 at 09:58

59

Even though sys.argv is a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ngrams", help="some useful description.")
args = parser.parse_args()
if args.ngrams:
    ngrams = args.ngrams

This way, you can launch your job as follows:

spark-submit job.py --ngrams 3

More information about argparse module can be found in Argparse Tutorial

edited Apr 11 '19 at 09:58

Marco

8,958
1
36
56

answered May 27 '16 at 13:18

noleto

1,534
16
12

3

Not working! Results says " [TerminalIPythonApp] CRITICAL | Unrecognized flag: '--ngrams' " – Andre Carneiro Aug 08 '18 at 13:19
1

If you have configs you want to send with your spark submit job, make sure to run with config info right after spark-submit, like: `spark-submit --master somemasterurl job.py --ngrams 3` – Will Aug 09 '18 at 22:25
Haven't tried this solution but this sounds a better one because it can remove the dependency on argument sequence. – Z.Wei Jun 03 '19 at 15:26
4

Has anybody figured how to use Pyspark with argparse? I'm continually getting an error `Unrecognized flag --arg1` and it's driving me insane! (Spark 2.4.4 and Python 3.6) – prrao Apr 27 '20 at 20:51
i have a whole json of parameters that I need to pass. Is there a simple way of getting that embedded in the spark submit command? – eljusticiero67 Nov 15 '22 at 23:40

score 46 · Accepted Answer · answered Aug 26 '15 at 02:50

46

Yes: Put this in a file called args.py

#import sys
print sys.argv

If you run

spark-submit args.py a b c d e

You will see:

['/spark/args.py', 'a', 'b', 'c', 'd', 'e']

answered Aug 26 '15 at 02:50

Paul

26,170
12
85
119

Vivarsh · Answer 3 · 2019-09-19T16:56:09.630

You can pass the arguments from the spark-submit command and then access them in your code in the following way,

sys.argv[1] will get you the first argument, sys.argv[2] the second argument and so on. Refer to the below example,

You can create code as below to take the arguments which you will be passing in the spark-submit command,

import os
import sys

n = int(sys.argv[1])
a = 2
tables = []
for _ in range(n):
    tables.append(sys.argv[a])
    a += 1
print(tables)

Save the above file as PysparkArg.py and execute the below spark-submit command,

spark-submit PysparkArg.py 3 table1 table2 table3

Output:

['table1', 'table2', 'table3']

This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.

score 1 · Answer 4 · answered Aug 26 '15 at 02:45

Ah, it's possible. http://caen.github.io/hadoop/user-spark.html

spark-submit \
    --master yarn-client \   # Run this as a Hadoop job
    --queue <your_queue> \   # Run on your_queue
    --num-executors 10 \     # Run with a certain number of executors, for example 10
    --executor-memory 12g \  # Specify each executor's memory, for example 12GB
    --executor-cores 2 \     # Specify each executor's amount of CPUs, for example 2
    job.py ngrams/input ngrams/output

I think the question is not how to pass them in but rather how to access the arguments once they where passed in — joh-mue, Mar 05 '19 at 14:05

score 1 · Answer 5 · answered Feb 20 '20 at 01:17

Aniket Kulkarni's spark-submit args.py a b c d e seems to suffice, but it's worth mentioning we had issues with optional/named args (e.g --param1).

It appears that double dashes -- will help signal that python optional args follow:

spark-submit --sparkarg xxx yourscript.py -- --scriptarg 1 arg1 arg2

Can I add arguments to python code when I submit spark job?

5 Answers5

Linked