Intro
I have a docker configured with Glue ETL PySpark environment, thanks to this AWS Glue tutorial. I used the "hellowrold.py":
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
medicare = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load('s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
medicare.printSchema()
I cannot run it doing spark-submit hellowrold.py
because I'm faced to well known error :
ModuleNotFoundError: No module named 'dynamicframe'
I found a hack: using the redirection operator: pyspark < helloworld.py
and it works like a charm.
My problem
HOWEVER. Now I need to pass some arguments to my script.
I used to (before trying to use Glue ETL) use : spark-submit myScript.py arg1 arg2 arg3
When I tried naively to do pyspark < myScript.py arg1 arg2 arg3
I got the following error:
Error: pyspark does not support any application options.
Minimal myScript.py to reproduce
import sys
from pyspark import SparkContext
from awsglue.context import GlueContext
# Hello world
glueContext = GlueContext(SparkContext.getOrCreate())
print(sys.argv[1] + " " + sys.argv[2] + " " + sys.argv[3])
Is there any solution to continue to use pyspark instead of spark-submit using some arguments?
Am I totally wrong, and is there a solution that can use spark-submit with Glue?