I run my pyspark code using the below command in dataproc.
gcloud dataproc batches submit pyspark src/features/spark_script.py \
--project=$PROJECT_ID \
--region=$REGION \
--deps-bucket=$DEPS_BUCKET \
--container-image=$CONTAINER_IMAGE \
--service-account=$SERVICE_ACCOUNT \
--subnet=$SUBNETWORK_URI \
--properties deployMode=cluster \
--py-files dist/src.zip \
--files pyproject.toml \
--archives dist/resources.zip
Within my code I am using the below code to read the YAML file present in resources.zip:
with open("./resources/config/configuration.yaml", "r") as f:
YAML_CONFIG = yaml.load(f, yaml.SafeLoader)
Still, it gives me
FileNotFoundError: [Errno 2] No such file or directory: './resources/config/configuration.yaml'
I tried giving deployMode=cluster
as suggested here. Even that didn't help.
It works only when I add the --files
argument to the spark-submit command, but I have many config files like this to be used in the code, so --archives
is the best option for me.
how can effectively debug and run this?