5

I'm using apache Tika 1.0. Using ForkParser, whenever I parse pdf files, I get the following NoClassDefFoundException:

java.lang.NoClassDefFoundError: org/apache/tika/fork/MemoryURLStreamHandler$Record
    at org.apache.tika.fork.MemoryURLStreamHandler.createURL(MemoryURLStreamHandler.java:46)
    at org.apache.tika.fork.ClassLoaderProxy.findResource(ClassLoaderProxy.java:73)
    at java.lang.ClassLoader.getResource(ClassLoader.java:977)
    at org.apache.log4j.helpers.Loader.getResource(Loader.java:96)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:105)
    at org.apache.log4j.Logger.getLogger(Logger.java:104)
    at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:289)
    at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:109)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.commons.logging.impl.LogFactoryImpl.createLogFromClass(LogFactoryImpl.java:1116)
    at org.apache.commons.logging.impl.LogFactoryImpl.discoverLogImplementation(LogFactoryImpl.java:914)
    at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:604)
    at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:336)
    at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:310)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
    at org.apache.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:58)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1087)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:80)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.tika.fork.ForkServer.call(ForkServer.java:136)
    at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:116)
    at org.apache.tika.fork.ForkServer.main(ForkServer.java:64)

Inspecting the jar shows that MemoryURLStreamHandler$Record exists in the tika-core jar file. When I use the AutoDetectParser instead of ForkParser, I am able to extract metadata from the files without any problems, but need to be able to constrain Tika memory usage, so am required to use ForkParser. How can I get pdf parsing to work with Tika's ForkParser?

Here's a snippet of code up to where I do the parse:

public static FileAnalysis analyze(File f) throws java.io.FileNotFoundException{

    FileInputStream     fis             = null;
    ToXMLContentHandler contentHandler  = new ToXMLContentHandler();
    Metadata            metadata        = new Metadata();
    ParseContext        context         = new ParseContext();
    ForkParser          parser          = new ForkParser();

    parser.setJavaCommand(props.getProperty("forkJavaCommand", "java") + " " +
                          props.getProperty("forkJavaMemory", "-Xmx64m"));

    parser.setPoolSize(1);

    fis = new FileInputStream(f);
    try {
        parser.parse(fis, contentHandler, metadata, context);
    } catch (Throwable e) {
        logger.error("Exception while analyzing file\n" +
        "CAUTION: metadata may still have useful content in it!\n" +
        "Exception:\n" + e, e);
    }

Edit #1

I tested both the Tika 1.0 and Tika 0.10 CLI app with the “-f” option and was receiving an IOException (Broken Pipe) while using the SoyLatte java 6 port for Mac OS-X. The port is only being run on my development machine, so I ran the CLI app (both 1.0 and 0.10) on a linux testing machine with the “-f” switch as follows

java -jar tika-app-1.0.jar -f /path/to/my/file.pdf

I was no longer receiving an exception, but I was also not getting any output. I found this odd, but thought it may still be working, just not producing any output (forever an optimist, I guess).

I unset all my environment variables in my Mac OS-X terminal and tried running the Tika CLI just as above with OS-X’s built in java 6. I got the same result as on the linux test machine, a few newlines get printed, but nothing else. I tried a jpg file instead of the pdf file and the tika app printed out the xhtml document with the metadata as advertised! I tried a docx file next, but like pdfs, does not print anything.

Edit #2

I wrote a small test java program and placed it outside of the context of our application so that it’s running in a fresh environment.

import java.io.File;
import java.io.FileInputStream;

import org.apache.tika.parser.ParseContext;
import org.apache.tika.fork.ForkParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.ToXMLContentHandler;

import org.apache.tika.Tika;

public class ForkParserTest {

    public static void main(String[] args) {

        if (args.length != 1) {
            System.out.println("must be passed the file to be parsed as the first argument");
            return;
        }

        try {
            File f = new File(args[0]);

            FileInputStream     fis             = null;
            ToXMLContentHandler contentHandler  = new ToXMLContentHandler();
            Metadata            metadata        = new Metadata();
            ParseContext        context         = new ParseContext();
            ForkParser          parser          = new ForkParser();

            fis = new FileInputStream(f);

            parser.parse(fis, contentHandler, metadata, context);

            System.out.println(contentHandler.toString());

        } catch (Exception e) { 
            System.out.println("Exception caught in main"); 
            e.printStackTrace();
        }

        return;
    }
}

I compiled it like so

javac -cp /path/to/tika-app-1.0.jar ForkParserTest.java

and ran like so

java -cp /path/to/tika-app-1.0.jar:${PWD} ForkParserTest /path/to/file.pdf

and tested it with a jpeg as well. It performs exactly like the Tika CLI app where it prints the XHTML document for the jpg but prints nothing for pdf or docx files.

If anyone knows how to resolve this problem, please let me know! Also, if you run this test on pdf files or docx files and actually get results to print, please also let me know how you did it.

Thanks!

I'm also fairly new to posting on stackoverflow, if this is totally tl;dr, feedback is appreciated, give me suggestions for how to make this more concise.

anchovie
  • 115
  • 5
  • 1
    Are you able to recreate this with the Tika CLI (using the --fork option), or does it only occur when using your custom code? (I've just had a quick try with the Tika CLI and can't hit your problem) – Gagravarr Dec 08 '11 at 05:31
  • @Gagravarr, I actually get a different problem when I use the TIKA CLI with the --fork (I used -f, they should be the same). I get a broken pipe IOException. What version of Tika are you using? I'm using Tika 1.0 with Java 6 on a Mac OS-X machine. When I try to extract metadata with the CLI without the -f option, the extraction works flawlessly. – anchovie Dec 09 '11 at 02:02
  • ok, just tried downloading the tika-app-1.0 binary from apache's site, tried the --fork option as well as -f, and still getting the same issue. I'm wondering if it's my environment, this is really frustrating. I'm a bit of a noob, so I'm not entirely sure what to try next. Anyway, thanks for the help so far. – anchovie Dec 09 '11 at 18:10
  • Just to confirm - do you get the text of your file when you run the TikaCLI with the --text option instead of the -f one? (Want to check it's the forking giving the issue, not just a problematic file) – Gagravarr Dec 11 '11 at 04:03
  • So, when I test this without any command line options, I get it to print out the XHTML doc like it's functioning correctly, so this really is an issue with the forkparser and specifically docx and PDF files (and maybe others). I tested the Tika cli with the -f option on a jpg and it works correctly, just not docx and PDFs. When you run a PDF through, does it work? What version of Tika are you using? – anchovie Dec 11 '11 at 16:27
  • It does look to be a bug. I've opened an issue for you for this - [TIKA-808](https://issues.apache.org/jira/browse/TIKA-808) - I'd suggest you watch that for progress – Gagravarr Dec 12 '11 at 02:26

0 Answers0