We are trying to design a simple program, where the goal is to read the patent data from a file, and check if other countries have cited that patent or not, this is from the text book 'Hadoop in Action'
by 'chuck Lam'
, where we are trying to learn about advanced map/reduce programming
.
The hadoop distribution which we have setup is Local Node
, and we are executing the program in the Windows environment
, using cygwin
.
This is the URL http://www.nber.org/patents/
from which we downloaded files : apat63_99.txt
and cite75_99.txt
.
We are using 'apat63_99.txt'
as the distributed cache files, and the 'cite75_99.txt'
is in the input
folder, which we are passing from the command line parameters.
The problem is that the program is not generating output, the output files which we are seeing has no data in it.
We have tried with the mapper phase as well as the reducer phase output and both are blank.
Here is the code which we have developed for this task:
package com.sample.patent;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Hashtable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class country_cite {
private static Hashtable<String, String> joinData
= new Hashtable<String, String>();
public static class Country_Citation_Class extends
Mapper<Text, Text, Text, Text> {
Path[] cacheFiles;
public void configure(JobConf conf) {
try {
cacheFiles = DistributedCache.getLocalCacheArchives(conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
if (cacheFiles != null && cacheFiles.length > 0) {
String line;
String[] tokens;
BufferedReader joinReader = new BufferedReader(new FileReader(
cacheFiles[0].toString()));
try {
while ((line = joinReader.readLine()) != null) {
tokens = line.split(",");
joinData.put(tokens[0], tokens[4]);
}
} finally {
joinReader.close();
}
}
if (joinData.get(key) != null)
context.write(key, new Text(joinData.get(key)));
}
}
public static class MyReduceClass extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String patent_country = joinData.get(key);
if (patent_country != null) {
for (Text val : values) {
String cited_country = joinData.get(val);
if (cited_country != null
&& !cited_country.equals(patent_country)) {
context.write(key, new Text(cited_country));
}
}
}
}
}
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new Path(args[0]).toUri(),
conf);
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 3) {
System.err.println("Usage: country_cite <in> <out>");
System.exit(2);
}
Job job = new Job(conf,"country_cite");
job.setJarByClass(country_cite.class);
job.setMapperClass(Country_Citation_Class.class);
job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat.class);
// job.setReducerClass(MyReduceClass.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The tool is Eclipse
and Hadoop's version
which we are using is 1.2.1
.
These are the command line parameters to run the job:
/cygdrive/c/cygwin64/usr/local/hadoop
$ bin/hadoop jar PatentCitation.jar country_cite apat63_99.txt input output
This is the trace which gets generated while the program executes:
/cygdrive/c/cygwin64/usr/local/hadoop
$ bin/hadoop jar PatentCitation.jar country_cite apat63_99.txt input output
Patch for HADOOP-7682: Instantiating workaround file system
14/06/22 12:39:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Patch for HADOOP-7682: Ignoring IOException setting persmission for path "file:/tmp/hadoop-RaoSa/mapred/staging/RaoSa1277400315/.staging": Failed to set permissions of path: \tmp\hadoop-RaoSa\mapred\staging\RaoSa1277400315\.staging to 0700
Patch for HADOOP-7682: Ignoring IOException setting persmission for path "file:/tmp/hadoop-RaoSa/mapred/staging/RaoSa1277400315/.staging/job_local1277400315_0001": Failed to set permissions of path: \tmp\hadoop-RaoSa\mapred\staging\RaoSa1277400315\.staging\job_local1277400315_0001 to 0700
14/06/22 12:39:21 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/06/22 12:39:21 INFO input.FileInputFormat: Total input paths to process : 1
14/06/22 12:39:21 WARN snappy.LoadSnappy: Snappy native library not loaded
Patch for HADOOP-7682: Ignoring IOException setting persmission for path "file:/tmp/hadoop-RaoSa/mapred/staging/RaoSa1277400315/.staging/job_local1277400315_0001/job.split": Failed to set permissions of path: \tmp\hadoop-RaoSa\mapred\staging\RaoSa1277400315\.staging\job_local1277400315_0001\job.split to 0644
Patch for HADOOP-7682: Ignoring IOException setting persmission for path "file:/tmp/hadoop-RaoSa/mapred/staging/RaoSa1277400315/.staging/job_local1277400315_0001/job.splitmetainfo": Failed to set permissions of path: \tmp\hadoop-RaoSa\mapred\staging\RaoSa1277400315\.staging\job_local1277400315_0001\job.splitmetainfo to 0644
Patch for HADOOP-7682: Ignoring IOException setting persmission for path "file:/tmp/hadoop-RaoSa/mapred/staging/RaoSa1277400315/.staging/job_local1277400315_0001/job.xml": Failed to set permissions of path: \tmp\hadoop-RaoSa\mapred\staging\RaoSa1277400315\.staging\job_local1277400315_0001\job.xml to 0644
14/06/22 12:39:23 INFO filecache.TrackerDistributedCacheManager: Creating fileapat63_99.txt in /tmp/hadoop-RaoSa/mapred/local/archive/7067728792316735217_-679065598_1881640498-work-5016028422992714806 with rwxr-xr-x
Patch for HADOOP-7682: Ignoring IOException setting persmission for path "/tmp/hadoop-RaoSa/mapred/local/archive/7067728792316735217_-679065598_1881640498-work-5016028422992714806": Failed to set permissions of path: \tmp\hadoop-RaoSa\mapred\local\archive\7067728792316735217_-679065598_1881640498-work-5016028422992714806 to 0755
14/06/22 12:40:06 INFO filecache.TrackerDistributedCacheManager: Cached apat63_99.txt as /tmp/hadoop-RaoSa/mapred/local/archive/7067728792316735217_-679065598_1881640498/fileapat63_99.txt
14/06/22 12:40:08 INFO filecache.TrackerDistributedCacheManager: Cached apat63_99.txt as /tmp/hadoop-RaoSa/mapred/local/archive/7067728792316735217_-679065598_1881640498/fileapat63_99.txt
14/06/22 12:40:09 INFO mapred.JobClient: Running job: job_local1277400315_0001
14/06/22 12:40:10 INFO mapred.LocalJobRunner: Waiting for map tasks
14/06/22 12:40:10 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000000_0
14/06/22 12:40:10 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:10 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:0+33554432
14/06/22 12:40:10 INFO mapred.JobClient: map 0% reduce 0%
14/06/22 12:40:15 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000000_0 is done. And is in the process of commiting
14/06/22 12:40:15 INFO mapred.LocalJobRunner:
14/06/22 12:40:15 INFO mapred.Task: Task attempt_local1277400315_0001_m_000000_0 is allowed to commit now
14/06/22 12:40:15 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000000_0' to output
14/06/22 12:40:15 INFO mapred.LocalJobRunner:
14/06/22 12:40:15 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000000_0' done.
14/06/22 12:40:15 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000000_0
14/06/22 12:40:15 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000001_0
14/06/22 12:40:15 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:15 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:33554432+33554432
14/06/22 12:40:16 INFO mapred.JobClient: map 12% reduce 0%
14/06/22 12:40:21 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000001_0 is done. And is in the process of commiting
14/06/22 12:40:21 INFO mapred.LocalJobRunner:
14/06/22 12:40:21 INFO mapred.Task: Task attempt_local1277400315_0001_m_000001_0 is allowed to commit now
14/06/22 12:40:21 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000001_0' to output
14/06/22 12:40:21 INFO mapred.LocalJobRunner:
14/06/22 12:40:21 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000001_0' done.
14/06/22 12:40:21 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000001_0
14/06/22 12:40:21 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000002_0
14/06/22 12:40:21 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:21 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:67108864+33554432
14/06/22 12:40:21 INFO mapred.JobClient: map 25% reduce 0%
14/06/22 12:40:26 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000002_0 is done. And is in the process of commiting
14/06/22 12:40:26 INFO mapred.LocalJobRunner:
14/06/22 12:40:26 INFO mapred.Task: Task attempt_local1277400315_0001_m_000002_0 is allowed to commit now
14/06/22 12:40:26 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000002_0' to output
14/06/22 12:40:26 INFO mapred.LocalJobRunner:
14/06/22 12:40:26 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000002_0' done.
14/06/22 12:40:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000002_0
14/06/22 12:40:26 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000003_0
14/06/22 12:40:26 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:26 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:100663296+33554432
14/06/22 12:40:26 INFO mapred.JobClient: map 37% reduce 0%
14/06/22 12:40:29 INFO mapred.LocalJobRunner:
14/06/22 12:40:29 INFO mapred.JobClient: map 42% reduce 0%
14/06/22 12:40:29 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000003_0 is done. And is in the process of commiting
14/06/22 12:40:29 INFO mapred.LocalJobRunner:
14/06/22 12:40:29 INFO mapred.Task: Task attempt_local1277400315_0001_m_000003_0 is allowed to commit now
14/06/22 12:40:29 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000003_0' to output
14/06/22 12:40:29 INFO mapred.LocalJobRunner:
14/06/22 12:40:29 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000003_0' done.
14/06/22 12:40:29 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000003_0
14/06/22 12:40:29 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000004_0
14/06/22 12:40:29 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:29 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:134217728+33554432
14/06/22 12:40:30 INFO mapred.JobClient: map 50% reduce 0%
14/06/22 12:40:30 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000004_0 is done. And is in the process of commiting
14/06/22 12:40:30 INFO mapred.LocalJobRunner:
14/06/22 12:40:30 INFO mapred.Task: Task attempt_local1277400315_0001_m_000004_0 is allowed to commit now
14/06/22 12:40:30 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000004_0' to output
14/06/22 12:40:30 INFO mapred.LocalJobRunner:
14/06/22 12:40:30 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000004_0' done.
14/06/22 12:40:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000004_0
14/06/22 12:40:30 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000005_0
14/06/22 12:40:30 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:30 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:167772160+33554432
14/06/22 12:40:31 INFO mapred.JobClient: map 62% reduce 0%
14/06/22 12:40:31 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000005_0 is done. And is in the process of commiting
14/06/22 12:40:31 INFO mapred.LocalJobRunner:
14/06/22 12:40:31 INFO mapred.Task: Task attempt_local1277400315_0001_m_000005_0 is allowed to commit now
14/06/22 12:40:31 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000005_0' to output
14/06/22 12:40:31 INFO mapred.LocalJobRunner:
14/06/22 12:40:31 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000005_0' done.
14/06/22 12:40:31 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000005_0
14/06/22 12:40:31 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000006_0
14/06/22 12:40:31 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:31 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:201326592+33554432
14/06/22 12:40:32 INFO mapred.JobClient: map 75% reduce 0%
14/06/22 12:40:32 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000006_0 is done. And is in the process of commiting
14/06/22 12:40:32 INFO mapred.LocalJobRunner:
14/06/22 12:40:32 INFO mapred.Task: Task attempt_local1277400315_0001_m_000006_0 is allowed to commit now
14/06/22 12:40:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000006_0' to output
14/06/22 12:40:32 INFO mapred.LocalJobRunner:
14/06/22 12:40:32 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000006_0' done.
14/06/22 12:40:32 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000006_0
14/06/22 12:40:32 INFO mapred.LocalJobRunner: Starting task: attempt_local1277400315_0001_m_000007_0
14/06/22 12:40:32 INFO mapred.Task: Using ResourceCalculatorPlugin : null
14/06/22 12:40:33 INFO mapred.MapTask: Processing split: file:/C:/cygwin64/usr/local/hadoop/input/cite75_99.txt:234881024+29194407
14/06/22 12:40:33 INFO mapred.JobClient: map 87% reduce 0%
14/06/22 12:40:35 INFO mapred.Task: Task:attempt_local1277400315_0001_m_000007_0 is done. And is in the process of commiting
14/06/22 12:40:35 INFO mapred.LocalJobRunner:
14/06/22 12:40:35 INFO mapred.Task: Task attempt_local1277400315_0001_m_000007_0 is allowed to commit now
14/06/22 12:40:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1277400315_0001_m_000007_0' to output
14/06/22 12:40:35 INFO mapred.LocalJobRunner:
14/06/22 12:40:35 INFO mapred.Task: Task 'attempt_local1277400315_0001_m_000007_0' done.
14/06/22 12:40:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local1277400315_0001_m_000007_0
14/06/22 12:40:35 INFO mapred.LocalJobRunner: Map task executor complete.
14/06/22 12:40:35 INFO mapred.JobClient: map 100% reduce 0%
14/06/22 12:40:35 INFO mapred.JobClient: Job complete: job_local1277400315_0001
14/06/22 12:40:35 INFO mapred.JobClient: Counters: 9
14/06/22 12:40:35 INFO mapred.JobClient: File Output Format Counters
14/06/22 12:40:35 INFO mapred.JobClient: Bytes Written=64
14/06/22 12:40:35 INFO mapred.JobClient: FileSystemCounters
14/06/22 12:40:35 INFO mapred.JobClient: FILE_BYTES_READ=5009033659
14/06/22 12:40:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3820489832
14/06/22 12:40:35 INFO mapred.JobClient: File Input Format Counters
14/06/22 12:40:35 INFO mapred.JobClient: Bytes Read=264104103
14/06/22 12:40:35 INFO mapred.JobClient: Map-Reduce Framework
14/06/22 12:40:35 INFO mapred.JobClient: Map input records=16522439
14/06/22 12:40:35 INFO mapred.JobClient: Spilled Records=0
14/06/22 12:40:35 INFO mapred.JobClient: Total committed heap usage (bytes)=708313088
14/06/22 12:40:35 INFO mapred.JobClient: Map output records=0
14/06/22 12:40:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=952
Kindly let us know where are we going wrong, in case if I have missed any vital information, let me know.
Thanks and Regards