5

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)?

I have read the documentation back-and-forth, and tried a few things out (using copy_file() with FS URIs), but none of it seems to work. The usage of the legacy HDFS API is straightforward, but it is deprecated, though the new API seems to be incomplete. Of course moving chunks of data among file descriptors is a solutions, but why copy_file() exists then?

Andor
  • 5,523
  • 5
  • 26
  • 24

2 Answers2

6

There are no functions in the new (or old) filesystem APIs for transferring files between filesystems.

Of course moving chunks of data among file descriptors is a solutions

I'm not sure if this is what you were thinking, but here is a simple utility (and demo) on how to do this from python:

import filecmp
import pyarrow.fs as pafs

BATCH_SIZE = 1024 * 1024

def transfer_file(in_fs, in_path, out_fs, out_path):
    with in_fs.open_input_stream(in_path) as in_file:
        with out_fs.open_output_stream(out_path) as out_file:
            while True:
                buf = in_file.read(BATCH_SIZE)
                if buf:
                    out_file.write(buf)
                else:
                    break

local_fs = pafs.LocalFileSystem()
s3fs = pafs.S3FileSystem()
in_path = '/tmp/in.data'
out_path = 'mybucket/test.data'
back_out_path = '/tmp/in_copy.data'

transfer_file(local_fs, in_path, s3fs, out_path)
transfer_file(s3fs, out_path, local_fs, back_out_path)

files_match = filecmp.cmp(in_path, back_out_path)
print(f'Files Match: {files_match}')

I would expect transfer_file to get good performance. There may be some situations (e.g. reading from S3) that could benefit from a parallel read using read_at which would require a bit more complexity but should also be doable.

but why copy_file() exists then?

copy_file copies a file from one name on a filesystem to a different name on the same filesystem. It cannot be used to copy files between different filesystems.

Mark Rajcok
  • 362,217
  • 114
  • 495
  • 492
Pace
  • 41,875
  • 13
  • 113
  • 156
0

To add to @Pace's answer (too long for a comment): I was copying gzip files (*.gz) which pyarrow was (by default) decompressing with each read() call and then compressing again with each write() call. I verified this with the print statement shown below in Pace's version of the code.

So, to get significantly faster transfers/copies, turn off compression:

def transfer_file(in_fs, in_path, out_fs, out_path):
    with in_fs.open_input_stream(in_path, compression=None) as in_file:
        with out_fs.open_output_stream(out_path, compression=None) as out_file:
            while True:
                buf = in_file.read(BATCH_SIZE)
                if buf:
                    print(f'buf size: len(buf)')
                    out_file.write(buf)
                else:
                    break
Mark Rajcok
  • 362,217
  • 114
  • 495
  • 492