How to Import files to HDFS as HAR archive? in Java

Question

Currently we are importing files into HDFS by invoking the the org.apache.hadoop.fs.FileSystem.moveFromLocalFile() method in the FileSystem API of hadoop, now we are encountering some large heap size in our namenode due to the number of small files being imported is too many and we want to reduce it. Is there an easier way to import the files as HAR into HDFS without having to import all small files first? in short I import the small files but in the HDFS there is 1 HAR file containing my imported files.

score 1 · Answer 1 · answered Dec 17 '15 at 09:33

It is not possible to directly ingest HAR (Hadoop ARchive) files into HDFS.

The better approach would be to, copy smaller files into HDFS first and then create a HAR file by merging all these smaller files together.

You can use hadoop archive (Usage: hadoop archive -archiveName {name of the archive} -p {Input parent folder path} {Output folder Path}) command to create a HAR file and after creating the HAR file, you can delete your original files.

If there are millions of small files, then you can copy these files in chunks.

For e.g. let's assume that you have 100,000 small files. One possible approach:

Copy 10,000 files into a temporary location in HDFS. For e.g. hdfs:///tmp/partition1/
Create a HAR file from these 10,000 files. For e.g. hdfs:///tmp/archive1/
After creating the archive, delete the files from hdfs:///tmp/partition1/
Repeat steps 1 to 3, till you ingest all the 100,000 files.

How to Import files to HDFS as HAR archive? in Java

1 Answers1