I'm using apache Tika 1.0. Using ForkParser, whenever I parse pdf files, I get the following NoClassDefFoundException:
java.lang.NoClassDefFoundError: org/apache/tika/fork/MemoryURLStreamHandler$Record
at org.apache.tika.fork.MemoryURLStreamHandler.createURL(MemoryURLStreamHandler.java:46)
at org.apache.tika.fork.ClassLoaderProxy.findResource(ClassLoaderProxy.java:73)
at java.lang.ClassLoader.getResource(ClassLoader.java:977)
at org.apache.log4j.helpers.Loader.getResource(Loader.java:96)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:105)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:289)
at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:109)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.commons.logging.impl.LogFactoryImpl.createLogFromClass(LogFactoryImpl.java:1116)
at org.apache.commons.logging.impl.LogFactoryImpl.discoverLogImplementation(LogFactoryImpl.java:914)
at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:604)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:336)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:310)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.apache.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:58)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1087)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:80)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tika.fork.ForkServer.call(ForkServer.java:136)
at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:116)
at org.apache.tika.fork.ForkServer.main(ForkServer.java:64)
Inspecting the jar shows that MemoryURLStreamHandler$Record exists in the tika-core jar file. When I use the AutoDetectParser instead of ForkParser, I am able to extract metadata from the files without any problems, but need to be able to constrain Tika memory usage, so am required to use ForkParser. How can I get pdf parsing to work with Tika's ForkParser?
Here's a snippet of code up to where I do the parse:
public static FileAnalysis analyze(File f) throws java.io.FileNotFoundException{
FileInputStream fis = null;
ToXMLContentHandler contentHandler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
parser.setJavaCommand(props.getProperty("forkJavaCommand", "java") + " " +
props.getProperty("forkJavaMemory", "-Xmx64m"));
parser.setPoolSize(1);
fis = new FileInputStream(f);
try {
parser.parse(fis, contentHandler, metadata, context);
} catch (Throwable e) {
logger.error("Exception while analyzing file\n" +
"CAUTION: metadata may still have useful content in it!\n" +
"Exception:\n" + e, e);
}
Edit #1
I tested both the Tika 1.0 and Tika 0.10 CLI app with the “-f” option and was receiving an IOException (Broken Pipe) while using the SoyLatte java 6 port for Mac OS-X. The port is only being run on my development machine, so I ran the CLI app (both 1.0 and 0.10) on a linux testing machine with the “-f” switch as follows
java -jar tika-app-1.0.jar -f /path/to/my/file.pdf
I was no longer receiving an exception, but I was also not getting any output. I found this odd, but thought it may still be working, just not producing any output (forever an optimist, I guess).
I unset all my environment variables in my Mac OS-X terminal and tried running the Tika CLI just as above with OS-X’s built in java 6. I got the same result as on the linux test machine, a few newlines get printed, but nothing else. I tried a jpg file instead of the pdf file and the tika app printed out the xhtml document with the metadata as advertised! I tried a docx file next, but like pdfs, does not print anything.
Edit #2
I wrote a small test java program and placed it outside of the context of our application so that it’s running in a fresh environment.
import java.io.File;
import java.io.FileInputStream;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.fork.ForkParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.ToXMLContentHandler;
import org.apache.tika.Tika;
public class ForkParserTest {
public static void main(String[] args) {
if (args.length != 1) {
System.out.println("must be passed the file to be parsed as the first argument");
return;
}
try {
File f = new File(args[0]);
FileInputStream fis = null;
ToXMLContentHandler contentHandler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
ForkParser parser = new ForkParser();
fis = new FileInputStream(f);
parser.parse(fis, contentHandler, metadata, context);
System.out.println(contentHandler.toString());
} catch (Exception e) {
System.out.println("Exception caught in main");
e.printStackTrace();
}
return;
}
}
I compiled it like so
javac -cp /path/to/tika-app-1.0.jar ForkParserTest.java
and ran like so
java -cp /path/to/tika-app-1.0.jar:${PWD} ForkParserTest /path/to/file.pdf
and tested it with a jpeg as well. It performs exactly like the Tika CLI app where it prints the XHTML document for the jpg but prints nothing for pdf or docx files.
If anyone knows how to resolve this problem, please let me know! Also, if you run this test on pdf files or docx files and actually get results to print, please also let me know how you did it.
Thanks!
I'm also fairly new to posting on stackoverflow, if this is totally tl;dr, feedback is appreciated, give me suggestions for how to make this more concise.