After coalesce() function in pyspark(databricks), the file was saved as a single csv with a weird name that starts with part-00000 or ends with .csv extension. I would like to rename it to a more user-friendly name in a function.
I trying the approach suggested below: https://medium.com/plusteam/move-and-rename-objects-within-an-s3-bucket-using-boto-3-58b164790b78
import boto3
s3_resource = boto3.resource(‘s3’)
# Copy object A as object B
s3_resource.Object(“bucket_name”, “newpath/to/object_B.txt”).copy_from(
CopySource=”path/to/your/object_A.txt”)
# Delete the former object A
s3_resource.Object(“bucket_name”, “path/to/your/object_A.txt”).delete()
The above code says to copy the object with the new name and delete the original file. However, after several tries, it only works when I put the whole weird name within the copy_source.
What I would like to do is since there is only one weirdly-named file, is to use the *.csv just like the way it works with pandas. I tried the endswith() function but seems to cannot work.
The answer from this Rename Pyspark output files in s3 renames each partition hence there is a obvious pattern.
import datetime
import boto3
s3 = boto3.resource('s3')
for i in range(5):
date = datetime.datetime(2019,4,29)
date += datetime.timedelta(days=i)
date = date.strftime("%Y-%m-%d")
print(date)
old_date = 'file_path/FLORIDA/DATE={}/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv'.format(date)
print(old_date)
date = date.replace('-','')
new_date = 'file_path/FLORIDA/allocation_FLORIDA_{}.csv'.format(date)
print(new_date)
s3.Object('my_bucket', new_date).copy_from(CopySource='my_bucket/' + old_date)
s3.Object('my_bucket', old_date).delete()
I think with pandas, it would have been: (note the use of *)
import boto3
s3_resource = boto3.resource(‘s3’)
# Copy object A as object B
s3_resource.Object(“bucket_name”, “newpath/to/object_B.csv”).copy_from(
CopySource=”path/to/your/*.csv”)
# Delete the former object A
s3_resource.Object(“bucket_name”, “path/to/your/*.csv”).delete()
but if used within databricks, it returns none