2

My job flow is as follows:

I am processing a huge amount of data. I have a MapFile that needs to be cached. The size of this file is 1 GB now but I expect it to grow eventually.

The content of the MapFile would be something like this:

12345,45464       192.34.23.1
33214,45321       123.45.32.1
  • In the map-phase, I process each record from the input files which is in TextInputFormat. I parse the line (split by tokens) and retrieve the first two tokens, token1 and token2.

If the pair of (token1,token2) is not in cached file, then I do an API call, get the information, persist in the cache (if possible) and proceed with processing.

 private Parser parser = new customParser();

protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

      parser.parse(value);
      Pair pair = new Pair();
      pair.setFirst(parser.getFirst());
      pair.setSecond(parser.getSecond());
      IP ip = null;

      //here is the catch
      //check if pair exists in cache
      if cache.contains(pair){
          ip=cache.get(pair);
       }
       else {
          ip=getFromAPI(pair);//This does API call outside network.
          cache.put(pair,ip);
       }
      context.write(pair,ip);
      }
    }

The main problems I am seeing here are

  1. How to get big file in cache across all nodes. DistributedCache works by copying the files to local nodes. but since this file is bigger, there is network traffic involved here and for my routine jobs I dont want to keep distributing it.

  2. How to efficiently look-up the MapFile(cache), the entire mapFile will not be in memory.

  3. How to write to this MapFile which is my Cache.

Thanks

brain storm
  • 30,124
  • 69
  • 225
  • 393

1 Answers1

0

As I see it there are three ways to handle this, and the best one depends on how your cache file will grow.

  1. If you don't expect the cache file to grow very much and that it will always fit into memory without hindering other applications or MapReduce jobs, you can put it into the HDFS cache. This feature is supported since Hadoop 2.3.0:

    HDFS caching lets users explicitly cache certain files or directories in HDFS. DataNodes will then cache the corresponding blocks in off-heap memory through the use of mmap and mlock. Once cached, Hadoop applications can query the locations of cached blocks and place their tasks for memory-locality. Finally, when memory-local, applications can use the new zero-copy read API to read cached data with no additional overhead.

The last two options are more suitable if you cannot be safe keeping it in memory as the cache file grows:

  1. This answer by Thomas Jungblut proposes to put your cache file into HDFS, increasing the replication count and reading it with the FileSystem API. This would still result in network communication for non-local replicas, but hopefully less than a transfer to all nodes in a DistributedCache. The FileSystem API also allows you to append to an existing file, letting you update the file.

  2. If you cache file is going to grow so much that you could have issues with storing the extra replication, you might instead want to consider letting it be retrieved as a part of the first mapping step.

    You could for instance take both the cache file and the file to be processed as input to the mapper, and for both inputs map the token pair. In the reduce step, your new cache file is built by outputting nothing if the a token pair has a line from both the cache file and the processed file and outputting a corresponding cache line in the two other possible cases.

Community
  • 1
  • 1
Alex A.
  • 2,646
  • 22
  • 36
  • The method you propose for reading in the cache file during mapping step is an interesting idea. but the downside here, every mapper will do it, which is unnecessary overhead. – brain storm Nov 03 '14 at 21:24
  • I found a third option you might find suitable. Edited the post to include it. – Alex A. Nov 03 '14 at 23:06