I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.
Asked
Active
Viewed 2,141 times
2 Answers
2
import os
os.system("/usr/bin/s3-dist-cp --src=s3://aiqdatabucket/aiq-inputfiles/de_pulse_ip/latest/ --dest=/de_pulse/ --groupBy='.*(additional).*' --targetSize=64 --outputCodec=none")
Note : - please make sure that you give the full path of s3-dist-cp like (/usr/bin/s3-dist-cp)
also, I think we can use subprocess.
0
If you're running a pyspark application, you'll have to stop the spark application first. The s3-dist-cp
will hang because the pyspark application is blocking.
spark.stop() # spark context
os.system("/usr/bin/s3-dist-cp ...")

lababidi
- 2,654
- 1
- 22
- 14