On an EMR, I am using s3-dist-cp --groupBy
in order to name the file with random fileName
in a folder to a name that i wish to rename it to in S3:
s3-dist-cp --groupBy='.*(folder_in_hdfs).*' --src=hdfs:///user/testUser/tmp-location/folder_in_hdfs --dest=s3://testLocation/folder_in_s3
Example:
hadoop fs -ls hdfs:///user/testUser/tmp-location/folder_in_hdfs
Found 2 items
-rw-r--r-- 1 hadoop hadoop 0 2019-04-05 14:54 hdfs:///user/testUser/tmp-location/folder_in_hdfs/file.csv/_SUCCESS
-rw-r--r-- 1 hadoop hadoop 493077 2019-04-05 14:54 hdfs:///user/testUser/tmp-location/folder_in_hdfs/file.csv/part-00000-12db8851-31be-4b08-8a93-1887e534941d-c000.csv
After running s3-dist-cp,
aws s3 ls s3://testLocation/folder_in_s3/
s3://testLocation/folder_in_s3/file.csv
However, I would like to achieve this functionality on Dataproc using hadoop distcp
commands and write the file to a GCS location gs://testLocation/folder_in_gs/file.csv
Any help is appreciated.