1

I've written a MR algorithm on some data to create a data structure. After creation I need to answer some queries. To answer these queries faster I created a metadata (around several MBs) from the result.

Now my question is this:

Is it possible to create this metadata in the memory of Master Node to avoid file I/O as a result answer queries faster?

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
AKJ88
  • 713
  • 2
  • 10
  • 20
  • What do you mean by create a data structure? When you say queries, do you mean you're going to run a MR job for the query? Please explain the scenario. – Jo Kachikaran Apr 17 '16 at 21:48
  • Imagine like this, you have B-Tree in the memory to point to data files on HDFS. For queries you refer to B-Tree to access some data files and then run a MR job. – AKJ88 Apr 18 '16 at 08:42

1 Answers1

1

Assuming, based on the OP response to the other answer, the metadata will be required for another MR job. Using Distributed cache in this case is rather easy:

In the driver class:

public class DriverClass extends Configured{

  public static void main(String[] args) throws Exception {

    /* ...some init code... */


    /*
    * Instantiate a Job object for your job's configuration.  
    */
    Configuration job_conf = new Configuration();
    DistributedCache.addCacheFile(new Path("path/to/your/data.txt").toUri(),job_conf);
    Job job = new Job(job_conf);

    /* ... configure and start the job... */

  }
}

In the mapper class you can read the data at the setup stage and make it available for the map class:

public class YourMapper extends Mapper<LongWritable, Text, Text, Text>{

  private List<String> lines = new ArrayList<String>();

  @Override
  protected void setup(Context context) throws IOException,
      InterruptedException {

    /* Get the cached archives/files */
    Path[] cached_file = new Path[0];
    try {
      cached_file = DistributedCache.getLocalCacheFiles(context.getConfiguration());
    } catch (IOException e1) {
      // TODO add error code
      e1.printStackTrace();
    }
    File f = new File (cached_file[0].toString());
    try {
      /* Read the data some thing like: */
      lines = Files.readLines(f,charset);
    } catch (IOException e) {

      e.printStackTrace();
    }
  }


  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {

      /*
      * In the mapper - use the data as needed
      */

  }
}    

Note that Distributed Cache can hold more the plain text file. You can use archives (zip, tar..) and even a full java class (jar files).

Also note that in newer Hadoop implementations, the Distributed Cache API is found in the Job class itself. Refer to this API and this answer.

Community
  • 1
  • 1
It-Z
  • 1,961
  • 1
  • 23
  • 33