Understanding --archive in dataproc pyspark

Question

This is what the commmand help says:

--archives=[ARCHIVE,...] Comma separated list of archives to be extracted into the working directory of each executor. Must be one of the following file formats: .zip, .tar, .tar.gz, or .tgz.

and, this answer here tells me that --archives will only be extracted on worker nodes

I am testing the --archive behavior the following way : tl;dr - 1. I create an archive and zip it. 2. I create a simple rdd and map its element to os. walk('./'). 3. The archive.zip gets listed as a directory but os.walk does not traverse down this branch

My archive directory:

.
├── archive
│   ├── a1.py
│   ├── a1.txt
│   └── archive1
│       ├── a1_in.py
│       └── a1_in.txt
├── archive.zip
└── main.py

2 directories, 6 files

Testing code:

import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(1))
walk_worker = rdd.map(lambda x: str(list(os.walk('./')))).distinct().collect()
walk_driver = list(os.walk('./'))
print('driver walk:', walk_driver)
print('worker walk:',walk_worker)

Dataproc run command:

gcloud dataproc jobs submit pyspark main.py --cluster pyspark-monsoon31 --region us-central1 --archives archive.zip

output:

driver walk: [('./', [], ['.main.py.crc', 'archive.zip', 'main.py', '.archive.zip.crc'])]
worker walk: ["[('./', ['archive.zip', '__spark_conf__', 'tmp'], ['pyspark.zip', '.default_container_executor.sh.crc', '.container_tokens.crc', 'default_container_executor.sh', 'launch_container.sh', '.launch_container.sh.crc', 'default_container_executor_session.sh', '.default_container_executor_session.sh.crc', 'py4j-0.10.9-src.zip', 'container_tokens']), ('./tmp', [], ['liblz4-java-5701923559211144129.so.lck', 'liblz4-java-5701923559211144129.so'])]"]

The output for driver node: The archive.zip is available but not extracted - EXPECTED

The output for worker node: os.walk is listing archive.zip as an extracted directory. The 3 directories available are ['archive.zip', '__spark_conf__', 'tmp']. But, to my surprise, only ./tmp is further traveresed and that is it

I have checked using os.listdir that archive.zip actually is a directory and not a zip. It's structure is:

└── archive.zip
    └── archive
        ├── a1.py
        ├── a1.txt
        └── archive1
            ├── a1_in.py
            └── a1_in.txt

So, why is os.walk not walking down the archive.zip directory?

Does this answer your question? [Dataproc does not unpack files passed as Archive](https://stackoverflow.com/questions/62645635/dataproc-does-not-unpack-files-passed-as-archive) — Igor Dvorzhak, Jul 10 '22 at 21:05

score 0 · Answer 1 · answered Feb 17 '22 at 18:03

archive.zip is added as a symlink to worker nodes. Symlinks are not traversed by default.

If you change to walk_worker = rdd.map(lambda x: str(list(os.walk('./', followlinks=True)))).distinct().collect() you will get the output you are looking for:

worker walk: ["[('./', ['__spark_conf__', 'tmp', 'archive.zip'], ...
 ('./archive.zip', ['archive'], []), ('./archive.zip/archive', ['archive1'], ['a1.txt', 'a1.py']), ...."]

Understanding --archive in dataproc pyspark

1 Answers1