Using arguments with Glue pyspark

Question

Intro

I have a docker configured with Glue ETL PySpark environment, thanks to this AWS Glue tutorial. I used the "hellowrold.py":

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

medicare = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load('s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
medicare.printSchema()

I cannot run it doing spark-submit hellowrold.py because I'm faced to well known error :

ModuleNotFoundError: No module named 'dynamicframe'

I found a hack: using the redirection operator: pyspark < helloworld.py and it works like a charm.

My problem

HOWEVER. Now I need to pass some arguments to my script.

I used to (before trying to use Glue ETL) use : spark-submit myScript.py arg1 arg2 arg3

When I tried naively to do pyspark < myScript.py arg1 arg2 arg3 I got the following error:

Error: pyspark does not support any application options.

Minimal myScript.py to reproduce

import sys
from pyspark import SparkContext
from awsglue.context import GlueContext

# Hello world
glueContext = GlueContext(SparkContext.getOrCreate())
print(sys.argv[1] + " " + sys.argv[2] + " " + sys.argv[3])

Is there any solution to continue to use pyspark instead of spark-submit using some arguments?

Am I totally wrong, and is there a solution that can use spark-submit with Glue?

Robert Kossendey · Accepted Answer · 2021-04-30T09:23:13.293

1

I would advise you to use the integration with PyCharm if possible. There you don't have the module error and you can inject arguments through the parameter option of the PyCharm run configuration.

The article that you linked also explains how to integrate with PyCharm.

Edit:

When I log into the Docker container and just run:

/home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/bin/spark-submit myScript.py test1, test2, test3

it prints out test1 test2 test3. I copied the exact content from your script. Could you please try that?

edited Apr 30 '21 at 09:23

answered Apr 29 '21 at 15:01

Robert Kossendey

6,733
2
12
42

Yes if you pay for *PyCharm Professional*, which I don't. And the problem still remains the same. – Jérémy Apr 29 '21 at 15:12
Can you provide your whole script? I don't have a problem running scripts with spark-submit – Robert Kossendey Apr 29 '21 at 16:18
I edited and added the minimal *myScript.py* to reproduce – Jérémy Apr 30 '21 at 08:54
I added to my answer. – Robert Kossendey Apr 30 '21 at 09:23

Using arguments with Glue pyspark

Intro

My problem

1 Answers1