0

I want to save image files (like jpeg, png etc) on HDFS (Hadoop File System). I tried two ways :

  1. Saved the image files as it is (i.e in the same format) into HDFS using put command. The full command was : hadoop fs -put /home/a.jpeg /user/hadoop/. It was successfully placed.
  2. Converted these image files into Hadoop's Sequence File format & then saved in HDFS using put command.

I want to know which format should be used to save in HDFS.
And what are the pros of using Sequence File format. One of the advantage that I know is that it is splittable. Is there any other ?

Abhilash Awasthi
  • 782
  • 5
  • 22

1 Answers1

1

images are very small in size compare to block size of HDFS storage. The problem with small files is the impact on processing performance, This is why you should use Sequence Files, HAR, HBase or merging solutions. see these two threads more info.

effective way to store image files

How many files is too many on a modern HDP cluster?

Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records.

Ronak Patel
  • 3,819
  • 1
  • 16
  • 29