If I have the following code to read the number of paragraphs (Office.PARAGRAPH_COUNT
) from a PDF:
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));
ContentHandler handler = new DefaultContentHandler();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();
Parser parser = TikaConfig.getDefaultConfig().getParser();
parser.parse(pdfStream, handler, pdfMeta, pc);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);
Does Tika obtain pdfParagraphCount
:
- Simply by querying the PDF's metadata for the count?; or by
- Applying some "paragraph counting" algorithm to the parser as it reads the entire PDF?
If the former is the case, is the metadata field holding the count writeable? Meaning, could it be wrong? Could any joker with iText or PDFbox manipulate the field and make it incorrect?
Is there any way to get Tika to count the paragraphs (correctly, by applying some algorithm or strategy) as its reading the PDF file?
Essentially, I need the number of paragraphs in a PDF, and I need it to be dead accurate, with no chance or a corrupted/incorrect, writeable metadata field (as I do not produce the original PDF myself). Thanks in advance.