There are no functions in the new (or old) filesystem APIs for transferring files between filesystems.
Of course moving chunks of data among file descriptors is a solutions
I'm not sure if this is what you were thinking, but here is a simple utility (and demo) on how to do this from python:
import filecmp
import pyarrow.fs as pafs
BATCH_SIZE = 1024 * 1024
def transfer_file(in_fs, in_path, out_fs, out_path):
with in_fs.open_input_stream(in_path) as in_file:
with out_fs.open_output_stream(out_path) as out_file:
while True:
buf = in_file.read(BATCH_SIZE)
if buf:
out_file.write(buf)
else:
break
local_fs = pafs.LocalFileSystem()
s3fs = pafs.S3FileSystem()
in_path = '/tmp/in.data'
out_path = 'mybucket/test.data'
back_out_path = '/tmp/in_copy.data'
transfer_file(local_fs, in_path, s3fs, out_path)
transfer_file(s3fs, out_path, local_fs, back_out_path)
files_match = filecmp.cmp(in_path, back_out_path)
print(f'Files Match: {files_match}')
I would expect transfer_file
to get good performance. There may be some situations (e.g. reading from S3) that could benefit from a parallel read using read_at
which would require a bit more complexity but should also be doable.
but why copy_file() exists then?
copy_file
copies a file from one name on a filesystem to a different name on the same filesystem. It cannot be used to copy files between different filesystems.