Getting paragraph count from Tika for both Word and PDF

Question

I have a scenario where I need to reconcile two documents, an Word (.docx) doc as well as a PDF. The two are supposed to be "indentical" to each other (the PDF is just a PDF version of the DOCX file); meaning they should contain the same text, content, etc.

Specifically, I need to make sure that both documents contain the same number of paragraphs. So I need to read the DOCX, get the paragraph count, then read the PDF and grab its paragraph count. If both numbers are the same, then I'm in business.

It looks like Apache Tika (I'm interested in 1.3) is the right tool for the job here. I see in this source file that Tika supports the notion of paragraph counting, but trying to figure out how to get the count from both documents. Here's my best attempt but I'm choking on connecting some of the final dots:

InputStream docxStream = new FileInputStream("some-doc.docx");
InputStream pdfStream = new FileInputStream("some-doc.pdf");

ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
Parser parser = new OfficeParser();
ParseContext pc = new ParseContext();

parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);

docxStream.close();
pdfStream.close();

int docxParagraphCount = docxMeta.getXXX(???);
int pdfParagraphCount = pdfMeta.getXXX(???);

if(docxParagraphCount == pdfParagraphCount)
    setInBusiness(myself, true);

So I ask: have I set this up correctly or am I way off base? If off-base, please lend me some help to get me back on track. And if I have set things up correctly, then how do I get the desired counts out of the two Metadata instances? Thanks in advance.

score 1 · Accepted Answer · answered Feb 20 '13 at 22:45

First up, Tika will only give you back the metadata contained within your documents. It won't compute anything for you. So, if one of your documents lacks the paragraph count metadata, you're out of luck. If one of your documents has duff data (i.e. the program that wrote the file out got it wrong), you'll be out of luck.

Otherwise, your code is nearly there, but not quite. You most likely want to use DefaultParser or AutoDetectParser - OfficeParser is for the Microsoft file formats only, while the others automatically load all the available parsers and pick the correct one.

The property you want is PARAGRAPH_COUNT, which comes from the Office metadata namespace. Your code would be something like:

TikaInputStream docxStream = TikaInputStream.get(new File("some-doc.docx"));
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));

ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();

Parser parser = TikaConfig.getDefaultConfig().getParser();

parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);

int docxParagraphCount = docxMeta.getInt(Office.PARAGRAPH_COUNT);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);

If you don't care about the text at all, only the metadata, pass in a dummy content handler

Awesome - thanks @Gagravarr (+1) - quick question - in your answer, you site `Office.PARAGRAPH_COUNT`. Should the 2nd one be something like `Pdf.PARAGRAPH_COUNT`, or does `Office.PARAGRAPH_COUNT` work for both document types? If so, how/why? I guess I just envisioned the PDF type defining its own properties, instead of `Office.PARAGRAPH_COUNT` applying to all documents. For instance, how do I know which document types `Office.PARAGRAPH_COUNT` **doesn't** work for? Etc. Thanks again! — , Feb 20 '13 at 23:04
The metadata namespace covering paragraph counts is defined in the [Office class](http://tika.apache.org/1.3/api/org/apache/tika/metadata/Office.html). One of the things Tika does is to normalise metadata onto a common set of definitions, so you don't need to know the idiosyncrasies of each format — Gagravarr, Feb 21 '13 at 09:57

Getting paragraph count from Tika for both Word and PDF

1 Answers1