Dataproc does not unpack files passed as Archive

Question

I'm trying to submit Dataproc with .NET spark Job.

The command line looks like:

gcloud dataproc jobs submit spark \
    --cluster=<cluster> \
    --region=<region> \
    --class=org.apache.spark.deploy.dotnet.DotnetRunner \
    --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
    --archives=gs://bucket/dotnet-build-output.zip \
    -- find

This command line should call find function to show the files in the current directory.

And I see only 2 files:

././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc

Eventually GCP does not unpack the file from Storage specified as --archives. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist

Dagang · Answer 1 · 2022-09-10T22:02:44.957

2

I think the problem is that your command ran in Spark driver which ran on the master node, because Dataproc runs in client mode by default. You can change it by adding --properties spark.submit.deployMode=cluster when submitting the job.

According to the usage help of the --archives flag:

 --archives=[ARCHIVE,...]
   Comma separated list of archives to be extracted into the working
   directory of each executor. Must be one of the following file formats:
   .zip, .tar, .tar.gz, or .tgz.

The archive will only be copied to both driver and executor dirs, but will only be extracted for executors. I tested submitting a job with --archives=gs://my-bucket/foo.zip which includes 2 files foo.txt and deps.txt, then I could find the extracted files on worker nodes:

my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt

edited Sep 10 '22 at 22:02

answered Jul 02 '20 at 22:18

Dagang

24,586
26
88
133

Want to check this parameter. That's weird the archive is not unpacked to master node if it is... It looks like cli uses a different credentials to process --archives and --jars files. – dr11 Jul 03 '20 at 23:38
1

Just checked. This is True... This is sooooo probelmatic IMO. How can one ever ship python dependencies in client mode then? main would run on driver as well and break at the import statement since the working directory won't have them .. – figs_and_nuts Jan 11 '22 at 17:01
@MiloMinderbinder Dataproc recently made a change to unpack archieve in driver's working dir. It should have been included in the latest images. Can you give it a try? – Dagang Jan 11 '22 at 18:57
I have the latest image only. Here is what I am doing: Since dataproc does not yet have pysaprk 3.2 image available and I needed that for my work, I am creating a dataproc cluster with an env that has pyspark 3.2 in it. Then I am manually modifying spark-env.sh to point SPARK_HOME to the pyspark inside my environment. The original cluster is created with pyspark 3.1 and ubuntu 18.04. This is as latest as from 2 days back. --archives does not unpack at driver node, --py-files does not unpack at all but the contents are available everywhere to import. I have not yet checked for --files – figs_and_nuts Jan 11 '22 at 19:04
@Dagang I am seeing that the latest image available was released on 22/1/2021. This is In the console under 'versioning' on the 'set up a cluster' page. Am I missing something? – figs_and_nuts Jan 11 '22 at 19:10
How do you submit Spark job? through spark-submit or Dataproc CLI / API? The feature (extracting archives in driver work dir) is only available if you use Dataproc CLI/API. – Dagang Jan 11 '22 at 19:23
CLI ```gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --archives=nitin.zip``` – figs_and_nuts Jan 11 '22 at 20:05
@Dagang Do you know if this has been resolved? I am finding that GCP Spark does not seem to unpack files provided by the --archives flag. Thus it is impossible to ship dependencies for me. Thanks for your insights! – alta May 18 '22 at 07:54
@figs_and_nuts - Can you try adding --archives=gs://bucket/nitin.zip#nitin to get this extracted automatically. – Tom J Muthirenthi May 11 '23 at 13:16

score 0 · Accepted Answer · answered Jul 07 '20 at 17:22

0

as @dagang mentioned --archives and --files parameters will not copy zip file to the driver instance, so that is the wrong direction.

I used this approach:

gcloud dataproc jobs submit spark \
        --cluster=<cluster> \
        --region=<region> \
        --class=org.apache.spark.deploy.dotnet.DotnetRunner \
        --jars=gs://<bucket>/microsoft-spark-2.4.x-0.11.0.jar \
        -- /bin/sh -c "gsutil cp gs://<bucket>/builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"

answered Jul 07 '20 at 17:22

dr11

5,166
11
35
77

```--files``` actually do get copied all the same on drivers and workers. ```--files``` are just a comma separated list of files and you can send a zip or a txt or anything, it will be placed on both workers and drivers and no extraction takes place. For ```--archives``` as well, they get placed on the driver as well but they are not extracted. You get just a .zip which is the compressed file that you sent. On executors that .zip is actually a directory containing what you had zipped – figs_and_nuts Jan 19 '22 at 22:39

Dataproc does not unpack files passed as Archive

2 Answers2

Linked