How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

Question

I had some problems in running a s3-dist-cp" command in my pyspark script as I needed some data movement from s3 to hdfs for performance enhancement. so here I am sharing this.

score 2 · Answer 1 · edited May 24 '21 at 15:52

2

import os

os.system("/usr/bin/s3-dist-cp --src=s3://aiqdatabucket/aiq-inputfiles/de_pulse_ip/latest/ --dest=/de_pulse/  --groupBy='.*(additional).*'  --targetSize=64 --outputCodec=none")

Note : - please make sure that you give the full path of s3-dist-cp like (/usr/bin/s3-dist-cp)

also, I think we can use subprocess.

edited May 24 '21 at 15:52

mguymon

8,946
2
39
61

answered Jan 23 '17 at 12:49

braj

2,545
2
29
40

score 0 · Answer 2 · answered Aug 13 '19 at 15:42

0

If you're running a pyspark application, you'll have to stop the spark application first. The s3-dist-cp will hang because the pyspark application is blocking.

spark.stop()  # spark context
os.system("/usr/bin/s3-dist-cp ...")

answered Aug 13 '19 at 15:42

lababidi

2,654
1
22
14

How do I run "s3-dist-cp" command inside pyspark shell / pyspark script in EMR 5.x

2 Answers2