I do have lots of image files and need to store them in HDFS, in order to avoid the Small Files Problem, I am planning to store my image files using Sequence Files.
My problem is that I need to create a MapReduce program that processes only a selection of those files, I don't think it is a good idea to read all of the images content from the SequenceFile if I am only planning to process a few of them, also, more images can be added , if I create a new SequenceFile for each bunch of images, how would I know which SequenceFile contains the images I need to process?. In case I knew it would be overwhelming to filter manually the images before making input to mapper.
Please advice. Thanks!