0

If I have the following code to read the number of paragraphs (Office.PARAGRAPH_COUNT) from a PDF:

TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));

ContentHandler handler = new DefaultContentHandler();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();

Parser parser = TikaConfig.getDefaultConfig().getParser();

parser.parse(pdfStream, handler, pdfMeta, pc);

int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);

Does Tika obtain pdfParagraphCount:

  • Simply by querying the PDF's metadata for the count?; or by
  • Applying some "paragraph counting" algorithm to the parser as it reads the entire PDF?

If the former is the case, is the metadata field holding the count writeable? Meaning, could it be wrong? Could any joker with iText or PDFbox manipulate the field and make it incorrect?

Is there any way to get Tika to count the paragraphs (correctly, by applying some algorithm or strategy) as its reading the PDF file?

Essentially, I need the number of paragraphs in a PDF, and I need it to be dead accurate, with no chance or a corrupted/incorrect, writeable metadata field (as I do not produce the original PDF myself). Thanks in advance.

  • If you really *need the number of paragraphs in a PDF, and it to be dead accurate,* you're out of luck. In general PDFs contain no such information explicitly, and trying to calculate that number from the content description can be done using heuristics only, and heuristics have a tendency to fail every once in a while, sometimes miserably so. – mkl Feb 21 '13 at 11:32
  • If your documents are generated by a given process, though, that process just *may* include such metadata in a proprietary way. In that case you may want to analyze your PDFs and search for such data. Obviously, though, that metadata can easily be manipulated. And even worse (as this does not require malevolant intent), that metadata may remain the same while the actual page content is changed. – mkl Feb 21 '13 at 11:44

1 Answers1

0

Tika gives you back the metadata from the document itself. It doesn't compute any metadata, all you get is what is there. (Tika will sometimes do a bit of work to normalise things between file formats, so that the metadata is consistent across different document types, but that's mostly just mapping onto standard metadata schemes).

You're also a bit out of luck though, I need it to be dead accurate is going to be an issue with a file format like PDF. PDF is not a line/paragraph based file format. Sure, you can generate a PDF where everything is relatively positioned in lines and paragraphs, but you can also build one where each character is placed absolutely on the page one at a time. Tika (and Apache PDFBox underneath) will do its best to turn that back into helpful blocks of text, but if someone really wanted to mess with you they could generate a PDF which is largely impossible to automatically classify into paragraphs...

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • As *Tika gives you back the metadata from the document itself* and seemingly (at least the question seems to indicate that) returns paragraph count data for PDFs even though PDFs do not generally have such metadata, do you have any idea how Tika evokes this count? – mkl Feb 21 '13 at 14:11
  • Tika calls down to PDFBox, fetches the metadata, and maps it onto the standard metadata namespaces that Tika uses. The `extractMetadata` method in [PDFParser](http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java) is where the magic happens, if you want to take a look! – Gagravarr Feb 21 '13 at 15:51
  • @Gagravarr Can I also expect to find the `WORD_COUNT` metadata of a pdf? Even if it is normally not part of pdf metadata. Ie: can I expect Tika to do the work and find the word count? (I checked in the [PDFParser](https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java) `extractMetadata` method and couldn't find any code that calculates PARAGRAPH_COUNT or WORD_COUNT). – theyuv Mar 26 '18 at 13:33
  • @theyuv On the whole, Tika doesn't calculate metadata, it just gives you the metadata that the file format contains. For word count specifically, you'll only get it if the file format has it – Gagravarr Mar 26 '18 at 16:37
  • @Gagravarr Thanks. I see, I thought from your description that Tika does some "work" in order to make sure that all files have some standard metadata. So if I want word count of things like pdfs are image files I would have to code that myself? – theyuv Mar 26 '18 at 17:48
  • @theyuv Tika does work to translate metadata to common namings and formats, not to calculate it – Gagravarr Mar 27 '18 at 07:30
  • @Gagravarr Is there something that we should use instead of `Metadata.WORD_COUNT` to obtain that property? Because I see that it's been [deprecated](https://tika.apache.org/1.17/api/) – theyuv Apr 02 '18 at 17:14
  • I think [Office.WORD_COUNT](https://tika.apache.org/1.17/api/org/apache/tika/metadata/Office.html#WORD_COUNT) is the one you'll want for 2.0 compatibility, though it isn't quite final yet what we are/aren't changing in 2 – Gagravarr Apr 02 '18 at 20:27