Replicated join using distributed cache in Hadoop 0.20

Question

I have been trying the Replicated join using distributed cache on both a cluster and a karmasphere interface. I have pasted code below. My program is unable to find the file in the cache memory

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Hashtable;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

// A demostration of Hadoop's DistributedCache tool
// 

public class MapperSideJoinWithDistributedCache extends Configured implements Tool {
        private  final  static  String inputa =  "C:/Users/LopezGG/workspace/Second_join/input1_1" ; 
public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

  private Hashtable<String, String> joinData = new Hashtable<String, String>();

  @Override
  public void configure(JobConf conf) {
    try {
      Path [] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
          System.out.println("ds"+DistributedCache.getLocalCacheFiles(conf));
      if (cacheFiles != null && cacheFiles.length > 0) {
        String line;
        String[] tokens;
        BufferedReader joinReader = new BufferedReader(new FileReader(cacheFiles[0].toString()));

        try {
          while ((line = joinReader.readLine()) != null) {
          tokens = line.split(",", 2);
          joinData.put(tokens[0], tokens[1]);
        }
        } finally {
          joinReader.close();
        }
      }
      else
          System.out.println("joinreader not set" );
    } catch(IOException e) {
      System.err.println("Exception reading DistributedCache: " + e);
    }
  }

  public void map(Text key, Text value, OutputCollector<Text, Text> output,  Reporter reporter) throws IOException {
    String joinValue = joinData.get(key.toString());
    if (joinValue != null) {
    output.collect(key,new Text(value.toString() + "," + joinValue));
    }
  }
}


public int run(String[] args) throws Exception {
  Configuration conf = getConf();
  JobConf job = new JobConf(conf, MapperSideJoinWithDistributedCache.class);

  DistributedCache.addCacheFile(new Path(args[0]).toUri(), job); 
  //System.out.println( DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf));
    Path in = new Path(args[1]);
  Path out = new Path(args[2]);
  FileInputFormat.setInputPaths(job, in);
  FileOutputFormat.setOutputPath(job, out);
  job.setJobName("DataJoin with DistributedCache");
  job.setMapperClass(MapClass.class);
  job.setNumReduceTasks(0);
  job.setInputFormat( KeyValueTextInputFormat.class);
  job.setOutputFormat(TextOutputFormat.class);
  job.set("key.value.separator.in.input.line", ",");
  JobClient.runJob(job);
  return 0;
}

  public static void main(String[] args) throws Exception {
                long time1= System.currentTimeMillis();
                System.out.println(time1);
      int res = ToolRunner.run(new Configuration(),
      new MapperSideJoinWithDistributedCache(),args);
          long time2= System.currentTimeMillis(); 
          System.out.println(time2);
          System.out.println("millsecs elapsed:"+(time2-time1));
      System.exit(res);

  }
}

The error I get is

O mapred.MapTask: numReduceTasks: 0
Exception reading DistributedCache: java.io.FileNotFoundException: \tmp\hadoop-LopezGG\mapred\local\archive\-2564469513526622450_-1173562614_1653082827\file\C\Users\LopezGG\workspace\Second_join\input1_1 (The system cannot find the file specified)
ds[Lorg.apache.hadoop.fs.Path;@366a88bb
12/04/24 23:15:01 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/04/24 23:15:01 INFO mapred.LocalJobRunner:

But the task executes to completion. Coudl someone please help me> i have looked at the other posts and made all modifications but still it does not work

score 0 · Answer 1 · answered Apr 25 '12 at 10:30

I must confess that i never use the DistributedCache class (rather i use the -files option via the GenericOptionsParser), but i'm not sure the DistributedCache automatically copies local files into HDFS prior to running your job.

While i can't find any evidence of this fact in the Hadoop docs, there is a mention in the Pro Hadoop book that mentions something to this effect:

http://books.google.com/books?id=8DV-EzeKigQC&pg=PA133&dq=%22The+URI+must+be+on+the+JobTracker+shared+file+system%22&hl=en&sa=X&ei=jNGXT_LKOKLA6AG1-7j6Bg&ved=0CEsQ6AEwAA#v=onepage&q=%22The%20URI%20must%20be%20on%20the%20JobTracker%20shared%20file%20system%22&f=false

In your case, copy the file to HDFS first, and the when you call DistributedCache.addCacheFile, pass the URI of the file in HDFS, and see if that works for you

Replicated join using distributed cache in Hadoop 0.20

1 Answers1