Extract text with style and format using TIKA from a PDF

Asked Feb 16 '15 at 14:10

Active Feb 16 '15 at 14:10

Viewed 1,997 times

I have a pdf file which contains section headings and its details, using Apache TIKA how do I extract text with its style and format?

asked Feb 16 '15 at 14:10

Suresh Gorakala

How are you calling Apache Tika? And if it isn't a way that returns the XHTML output, what happens when you switch to one? – Gagravarr Feb 16 '15 at 23:30
See also the [Apache Tika examples on picking your output format](http://tika.apache.org/1.7/examples.html#Picking_different_output_formats) – Gagravarr Feb 16 '15 at 23:31
1

I have used : Parsing to XHTML code snippet from the above link. got this exception "Exception in thread "main" org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared" – Suresh Gorakala Feb 17 '15 at 11:09
Are you using the latest version of Apache Tika? And have you made sure you have all of the Tika dependencies available? (That example has unit tests which pass just fine, so the problem would seem to be at your end) – Gagravarr Feb 17 '15 at 12:38

0 Answers0