Assuming, based on the OP response to the other answer, the metadata will be required for another MR job. Using Distributed cache in this case is rather easy:
In the driver class:
public class DriverClass extends Configured{
public static void main(String[] args) throws Exception {
/* ...some init code... */
/*
* Instantiate a Job object for your job's configuration.
*/
Configuration job_conf = new Configuration();
DistributedCache.addCacheFile(new Path("path/to/your/data.txt").toUri(),job_conf);
Job job = new Job(job_conf);
/* ... configure and start the job... */
}
}
In the mapper class you can read the data at the setup stage and make it available for the map class:
public class YourMapper extends Mapper<LongWritable, Text, Text, Text>{
private List<String> lines = new ArrayList<String>();
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
/* Get the cached archives/files */
Path[] cached_file = new Path[0];
try {
cached_file = DistributedCache.getLocalCacheFiles(context.getConfiguration());
} catch (IOException e1) {
// TODO add error code
e1.printStackTrace();
}
File f = new File (cached_file[0].toString());
try {
/* Read the data some thing like: */
lines = Files.readLines(f,charset);
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
/*
* In the mapper - use the data as needed
*/
}
}
Note that Distributed Cache can hold more the plain text file. You can use archives (zip, tar..) and even a full java class (jar files).
Also note that in newer Hadoop implementations, the Distributed Cache API is found in the Job class itself. Refer to this API and this answer.