0

I am crawling using Apache Nutch 1.13 . During parse step I am getting this error. I am not able to produce the url leading to this error

java.lang.Exception: java.lang.NoSuchMethodError: org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V
        at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:120)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:134)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:107)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:109)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

I traced the logs but I couldn't understand the issue. Any help will b appreciated!!

Vibhor Verma
  • 161
  • 1
  • 4
  • 13
  • Looks like you have an old version of Commons Compress on your classpath. How did you get the jars, manually or with something like maven? – Gagravarr Aug 25 '18 at 13:51
  • 2
    Nutch 1.14 fixed an issue ([NUTCH-2378](https://issues.apache.org/jira/browse/NUTCH-2378)) - core library dependencies where prioritized over plugin dependencies. – Sebastian Nagel Aug 26 '18 at 16:36
  • @Gagravarr I went to the maven website and grabed it. – Vibhor Verma Aug 26 '18 at 21:11
  • @SebastianNagel I patched the issue, I found some redundant code in Apache Tika, at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:120), I just removed the boolean argument as It had same functionality even without it. Compiled it and used it. – Vibhor Verma Aug 26 '18 at 21:15
  • Thank you @SebastianNagel, I missed that issue when I was wandering in Nutch Issue Tracker. will add this patch to Nutch :D – Vibhor Verma Aug 26 '18 at 21:16

0 Answers0