Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Question

I am using Solr 4.0 and DIH (data import handler) with TikaProcessor for extracting text from PDF files stored in database. When I run indexing it gets failed to parse some PDF files and got the stack trace mentioned below.

Since Solr 4.0 uses Tika 1.2 I have written a unit test to parse the same PDF file using Tika 1.2 API, I got the same error.

The same problem with Tika 1.3 jars also. But when I tried using Tika 1.1 jars it works fine. Please let me if any of you have seen this error and how to fix this?

(I have posted the same in tika mailing list, but not much luck)

When I open the PDF file it is showing PDF/A mode. Not sure if this something related to the problem.

Here is the exception:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6>
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
      at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
      at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 3 more

Here is the code snippet in JAVA:

 String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
 File file = new File(fileString );
 URL url = file.toURI().toURL();

 ParseContext context = new ParseContext();;
 Detector detector = new DefaultDetector();;
 Parser parser =  new AutoDetectParser(detector);;
 Metadata metadata = new Metadata();
 context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
 ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
 InputStream input = TikaInputStream.get(url, metadata);
 ContentHandler handler = new BodyContentHandler(outputstream);
 parser.parse(input, handler, metadata, context);

 input.close();
 outputstream.close();

Have you tried with Tika 1.3? (There have been bug fixes since 1.2) — Gagravarr, Feb 13 '13 at 10:12
Looks like a PDFBox bug. Can you try dropping in the latest PDFBox jar to Tika and see if that fixes it? Otherwise you'll need to report a bug to the PDFBox project — Gagravarr, Feb 13 '13 at 13:29

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

0 Answers0