This is what the commmand help says:
--archives=[ARCHIVE,...] Comma separated list of archives to be extracted into the working directory of each executor. Must be one of the following file formats: .zip, .tar, .tar.gz, or .tgz.
and, this answer here tells me that --archives
will only be extracted on worker nodes
I am testing the --archive
behavior the following way :
tl;dr - 1. I create an archive and zip it. 2. I create a simple rdd and map its element to os. walk('./')
. 3. The archive.zip
gets listed as a directory but os.walk
does not traverse down this branch
My archive
directory:
.
├── archive
│ ├── a1.py
│ ├── a1.txt
│ └── archive1
│ ├── a1_in.py
│ └── a1_in.txt
├── archive.zip
└── main.py
2 directories, 6 files
Testing code:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(1))
walk_worker = rdd.map(lambda x: str(list(os.walk('./')))).distinct().collect()
walk_driver = list(os.walk('./'))
print('driver walk:', walk_driver)
print('worker walk:',walk_worker)
Dataproc run command:
gcloud dataproc jobs submit pyspark main.py --cluster pyspark-monsoon31 --region us-central1 --archives archive.zip
output:
driver walk: [('./', [], ['.main.py.crc', 'archive.zip', 'main.py', '.archive.zip.crc'])]
worker walk: ["[('./', ['archive.zip', '__spark_conf__', 'tmp'], ['pyspark.zip', '.default_container_executor.sh.crc', '.container_tokens.crc', 'default_container_executor.sh', 'launch_container.sh', '.launch_container.sh.crc', 'default_container_executor_session.sh', '.default_container_executor_session.sh.crc', 'py4j-0.10.9-src.zip', 'container_tokens']), ('./tmp', [], ['liblz4-java-5701923559211144129.so.lck', 'liblz4-java-5701923559211144129.so'])]"]
The output for driver node: The archive.zip
is available but not extracted - EXPECTED
The output for worker node: os.walk
is listing archive.zip
as an extracted directory. The 3 directories available are ['archive.zip', '__spark_conf__', 'tmp']
. But, to my surprise, only ./tmp
is further traveresed and that is it
I have checked using os.listdir
that archive.zip
actually is a directory and not a zip. It's structure is:
└── archive.zip
└── archive
├── a1.py
├── a1.txt
└── archive1
├── a1_in.py
└── a1_in.txt
So, why is os.walk
not walking down the archive.zip
directory?