2

I'm using Nutch 1.13 to crawl datas and store them to elasticsearch. I have created some custom parse filter and index filter plugins too. Everything was working fine.

I updated elasticsearch to version 5. Then, indexer-elastic plugin stopped working due to version mismatch. Also, from some documentations I came to know that elasticsearch version 5 will only support in nutch 2+ versions.

But, I stick with this nutch version and found a plugin to index to elasticsearch over rest from here. Made changes in nutch to include this plugin.

Tried crawling and indexing and it worked in local mode of nutch. When I tried the same in deployed mode, I got the following exception at indexing phase:

17/11/16 10:53:37 INFO mapreduce.Job: Running job: job_1510809462003_0010
17/11/16 10:53:44 INFO mapreduce.Job: Job job_1510809462003_0010 running in uber mode : false
17/11/16 10:53:44 INFO mapreduce.Job:  map 0% reduce 0%
17/11/16 10:53:48 INFO mapreduce.Job:  map 20% reduce 0%
17/11/16 10:53:52 INFO mapreduce.Job:  map 40% reduce 0%
17/11/16 10:53:56 INFO mapreduce.Job:  map 60% reduce 0%
17/11/16 10:53:59 INFO mapreduce.Job:  map 80% reduce 20%
17/11/16 10:54:02 INFO mapreduce.Job:  map 100% reduce 100%
17/11/16 10:54:02 INFO mapreduce.Job: Task Id : attempt_1510809462003_0010_r_000000_0, Status : FAILED
Error: INSTANCE
17/11/16 10:54:03 INFO mapreduce.Job:  map 100% reduce 0%
17/11/16 10:54:06 INFO mapreduce.Job: Task Id : attempt_1510809462003_0010_r_000000_1, Status : FAILED
Error: INSTANCE
17/11/16 10:54:10 INFO mapreduce.Job: Task Id : attempt_1510809462003_0010_r_000000_2, Status : FAILED
Error: INSTANCE
17/11/16 10:54:15 INFO mapreduce.Job:  map 100% reduce 100%
17/11/16 10:54:15 INFO mapreduce.Job: Job job_1510809462003_0010 failed with state FAILED due to: Task failed task_1510809462003_0010_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1

17/11/16 10:54:15 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=804602
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=44204
HDFS: Number of bytes written=0
HDFS: Number of read operations=20
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters 
Failed reduce tasks=4
Killed map tasks=1
Launched map tasks=5
Launched reduce tasks=4
Data-local map tasks=5
Total time spent by all maps in occupied slots (ms)=39484
Total time spent by all reduces in occupied slots (ms)=16866
Total time spent by all map tasks (ms)=9871
Total time spent by all reduce tasks (ms)=16866
Total vcore-milliseconds taken by all map tasks=9871
Total time spent by all reduce tasks (ms)=16866
Total vcore-milliseconds taken by all map tasks=9871
Total vcore-milliseconds taken by all reduce tasks=16866
Total megabyte-milliseconds taken by all map tasks=40431616
Total megabyte-milliseconds taken by all reduce tasks=17270784
Map-Reduce Framework
Map input records=436
Map output records=436
Map output bytes=55396
Map output materialized bytes=56302
Input split bytes=698
Combine input records=0
Spilled Records=436
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=246
CPU time spent (ms)=3840
Physical memory (bytes) snapshot=1559916544
Virtual memory (bytes) snapshot=25255698432
Total committed heap usage (bytes)=1503657984
File Input Format Counters 
Bytes Read=43506
17/11/16 10:54:15 ERROR impl.JobWorker: Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:94)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:87)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:352)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Hadoop log is :

2017-11-16 10:54:13,731 INFO [main] org.apache.nutch.indexer.IndexWriters: Adding org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
2017-11-16 10:54:13,801 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchFieldError: INSTANCE
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:144)
    at org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter.open(ElasticRestIndexWriter.java:133)
    at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:75)
    at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:484)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

After searching about this, I came to know that it is due to some version issue with http jars. The hadoop version I used is 2.7.2. I tried the same with hadoop version 2.8.2 and the result was the same.

Looking for solutions.

SOLVED : The issue was with an older jar version of http core in hadoop 2.7.2. Removed those jars and solved the problem.

Abhishek Ramachandran
  • 1,160
  • 1
  • 13
  • 34

0 Answers0