0

I do have lots of image files and need to store them in HDFS, in order to avoid the Small Files Problem, I am planning to store my image files using Sequence Files.

My problem is that I need to create a MapReduce program that processes only a selection of those files, I don't think it is a good idea to read all of the images content from the SequenceFile if I am only planning to process a few of them, also, more images can be added , if I create a new SequenceFile for each bunch of images, how would I know which SequenceFile contains the images I need to process?. In case I knew it would be overwhelming to filter manually the images before making input to mapper.

Please advice. Thanks!

zaz
  • 25
  • 1
  • 4

2 Answers2

0

If you can store your files in MapFile which is SequenceFile with an index, you can use MapFile.Reader to query some file by the key. For example,

MapFile.Reader reader = MapFile.Reader(fs, dirName, conf);


public byte[] get(String filename) {
    TextWritable key = new TextWritable();
    BytesWritable value = new BytesWritable();
    if(reader.get(key,value) != null) {
        return value.copyBytes();
    }
    else {
        return null;
    }
}

If you files are generated by a MapReduce application, you can use MapFileOutputFormat to output MapFile.

In addition, since you only need to process a few files, I think your don't need MapReduce in such process.

zsxwing
  • 20,270
  • 4
  • 37
  • 59
  • Thanks a lot for your answer, just one more question, after the query I need to process these images and send them to the mappers so that they process them, since I did query the files I need, I guess the values would be loaded into memory?, if that is the case then images would be pulled out from their nodes?, would Hadoop still be able to run the mapper in the node where the data was originally stored? Thanks again!! – zaz Feb 27 '14 at 16:53
  • Why MapReduce? If you only need to process 2-3 files, directly manipulating them would have high performance. If you insist that you need to run it in MapReduce, you need to write your custom InputFormat. The default one will scan all of the data. – zsxwing Feb 28 '14 at 08:04
  • Of courcse, if you directly use `MapFile.Reader`, the data usually need to be sent by network. But since you mentioned there were only a few files, I think it's fine. – zsxwing Feb 28 '14 at 08:06
  • Thanks a lot for your comments, I think I will create my custom InputFormat and RecordReader to be able to filter the images before sending to mappers and try to avoid the network cost. When I say filter Im talking maybe about reducing 10000 files out of 1000000. Thanks!! – zaz Feb 28 '14 at 16:31
0

You could store the image files in HBase along with any other attributes of the images - that you may want to filter/query on. This will allow you to selectively query for images.

See this:
http://apache-hbase.679495.n3.nabble.com/Storing-images-in-Hbase-td4036184.html
http://www.slideshare.net/jacque74/hug-hbase-presentation

Jasper
  • 8,440
  • 31
  • 92
  • 133