1

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.

So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.

However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.

Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.

I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.

Any help would be greatly appreciated.

feldoh
  • 648
  • 2
  • 6
  • 19

1 Answers1

0

I see there are three ways you can do it.

1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing

2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.

file://192.168.2.3/usr/rags/data/DFile1.xyz
.......

You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing

3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.

Cheers Rags

Rags
  • 1,891
  • 18
  • 19
  • Thanks for the advice, 1 is no good, this needs to be kept updated so the data will only be useful for a day or so and as the files I am reading are part of an active database the data would be far more stale. 2 is interesting, NLineInputFormat sounds like part of the answer but as you point out does not guarantee locality. 3. sounds plausible, but what do you mean by remove the other 3 nodes, I can't take parts of the cluster offline if that's what you mean, the cluster needs to be able to respond to other jobs as well, this has to be a background task considering the probable runtime. – feldoh Mar 27 '13 at 14:06
  • Hi Feldoh, you got me right. Taking part of the cluster offline. Now I understand it's not possible as it is a production cluster. – Rags Mar 28 '13 at 12:07
  • While using the above method, a way to make sure that it runs locally is to check the file location in the map, if it is remote throw an exception to fail the task. the job tracker will try until the task is successful which means it finds file locally. But this is crude method. To make the job continue, you may need to increase the max attempts for map tasks to a high number. – Rags Mar 28 '13 at 12:15
  • I had considered that approach, as you say it's a bit crude and theoretically it is still possible that some nodes may never be chosen but if thats all I can do then thats that, thanks for the advice, I really appreciate you taking the time. – feldoh Mar 28 '13 at 12:30