Requirement is to pasre pdf and document file. How to parse only required page for example in a doc / pdf file there are 10 pages. But requirement is to parse only Page 1 -3 and Last page.
Asked
Active
Viewed 44 times
0
-
How are you calling Apache Tika? With no code, and not even library vs cli vs server, there's very little we can do to help... – Gagravarr Sep 16 '19 at 13:29
-
I am calling in java to extract the content of pdf, doc and ocx file. – Santosh Singh Sep 20 '19 at 07:01
-
OK, and with what code? – Gagravarr Sep 20 '19 at 09:25
-
I am using Java, I am able get whole content bu requirement is how to parse only some page not all pages. – Santosh Singh Sep 23 '19 at 12:30
-
We still need to see your Java code! – Gagravarr Sep 23 '19 at 14:02
-
handler = new BodyContentHandler(-1); metadata = new Metadata(); pcontext = new ParseContext(); allParser = new AutoDetectParser(); inputstream = new FileInputStream(new File(filename)); allParser.parse(inputstream, handler, metadata, pcontext); – Santosh Singh Sep 27 '19 at 07:13
-
You need to edit this into your question. In general though, for the file formats which support it, you need to be getting the XHTML version rather than the plain text version to be able to detect page breaks – Gagravarr Sep 27 '19 at 08:31
-
Thank you for your reply, There is option to set data size for example parse only 100 KB etc, but unable to parse only First and last page. – Santosh Singh Sep 30 '19 at 05:04
-
You can't ask for the first page, you can ask for the XHTML and then (for the file formats that support page information) split the xhtml by page – Gagravarr Sep 30 '19 at 07:49
-
My requirement is to parse doc / docx and pdf file not the html file. – Santosh Singh Oct 01 '19 at 10:36
-
You ask Tika to convert your PDF / DOC / DOCX to XHTML, then split the pages from that – Gagravarr Oct 01 '19 at 11:10