0

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not directories) for copy via distcp.

I have set up my program to collect an array of filepaths using a function, inject them all into a distcp command, and then run the command:

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"

logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)

This basically just creates one long distcp command with 15-20 different filepaths. Will this work? Should I be using the -cp or -put commands instead of distcp?

(It doesn't make sense to me to copy all these files to their own directory and then distcp that entire directory, when I can just copy them directly and skip those steps...)

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
tprebenda
  • 389
  • 1
  • 6
  • 17
  • If you want to export data from Hive to S3, why not use an EXTERNAL table and `INSERT INTO SELECT FROM` query? Or use PySpark instead of your subprocess script? – OneCricketeer Feb 23 '22 at 23:37
  • @OneCricketeer I personally haven't heard anything about the first option... I suppose we could use pyhive or something like that. For PySpark, I think we wanted to use this option instead just because its a faster, more direct method of copying. Do you have an answer to my question though? I unfortunately don't have the time to investigate these alternatives and implement them – tprebenda Feb 24 '22 at 15:48
  • I have never used `distcp` with more that one file, actually. If everything is in the same folder, you should be able to copy whole directories at once – OneCricketeer Feb 24 '22 at 16:47

1 Answers1

1

-cp and -put would require you to download the HDFS files, then upload to S3. That would be a lot slower.

I see no immediate reason why this wouldn't work, however, reading over the documentation, I would recommend using -f flag instead.

E.g.

files = self.get_files_for_upload()
if not files:
    logger.warning("No recently updated files found. Exiting...")
    return

src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
    for file in files:
        f.write(f'hdfs://nameservice1{file}\n')

s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)

If the all files were already in their own directory, then you should just copy the directory, like you said.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Are you able to use distcp without specifying an absolute path? I don't think we can just use `src_file`, I think I need some way of finding the `hdfs://path_to/src_file`, which is harder for me due to permission issues – tprebenda Feb 25 '22 at 16:14
  • `src_file` is a local file that contains the list of URIs to copy. https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html – OneCricketeer Feb 25 '22 at 16:56
  • Yeah if you look at that page, every single filepath given has the "hdfs://" prefix. I've run it with simply `src_file`, and it fails saying the file does not exist. – tprebenda Feb 25 '22 at 21:21