My job flow is as follows:
I am processing a huge amount of data. I have a MapFile
that needs to be cached. The size of this file is 1 GB now but I expect it to grow eventually.
The content of the MapFile would be something like this:
12345,45464 192.34.23.1
33214,45321 123.45.32.1
- In the
map-phase
, I process each record from the input files which is inTextInputFormat
. I parse the line (split by tokens) and retrieve the first two tokens, token1 and token2.
If the pair of (token1,token2) is not in cached file, then I do an API call, get the information, persist in the cache (if possible) and proceed with processing.
private Parser parser = new customParser();
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
parser.parse(value);
Pair pair = new Pair();
pair.setFirst(parser.getFirst());
pair.setSecond(parser.getSecond());
IP ip = null;
//here is the catch
//check if pair exists in cache
if cache.contains(pair){
ip=cache.get(pair);
}
else {
ip=getFromAPI(pair);//This does API call outside network.
cache.put(pair,ip);
}
context.write(pair,ip);
}
}
The main problems I am seeing here are
How to get big file in cache across all nodes. DistributedCache works by copying the files to local nodes. but since this file is bigger, there is network traffic involved here and for my routine jobs I dont want to keep distributing it.
How to efficiently look-up the MapFile(cache), the entire mapFile will not be in memory.
How to write to this MapFile which is my Cache.
Thanks