Getting the RIGHT word count of a PDF file

Question

The response in this topic helped me understand why sometimes my PDF fails to find a word and why I keep getting different word counts when using different PDF word count programs. I decided to use xpdf. I converted it to text and added the -layout tag and then opened the resulting text file with Word 2003. I noted the word count. Then I decided, unfortunately, to remove the -layout tag. This time, though, the word count is different.

Why did that tag affect the word count? Is there an accurate way to find the word count of a PDF file? I would even pay for such software if I have to so long as it gives me the right number of words.

(I checked another topic but thought I'd find out if the solution I just offered would solve everything. There was another topic where advancedpdf was recommended.)

PDFs aren't designed to be machine-readable. Either go with some OCR solution with manual corrections or hire people to count the words for you, whatever's cheaper. — Kos, Mar 01 '12 at 14:56
i thought you guys would tell me that the information the user posted in OLD TOPIC was correct and that I should stand by it. What I understood from that post was that the words were counted including the words that were split into pieces. Well I think I'll stick with this one nonetheless. Thank you! — user1242840, Mar 01 '12 at 16:35

score 2 · Answer 1 · answered Mar 01 '12 at 14:52

2

I'd like to argue that there is no reliable word counting. One could, for example, just to make your life harder, put each character of this lovely Stackoverflow answer into a single text object and position such objects such that, only when rendered, gives a meaningful paragraph to humans. Like this:

<html><body><style>
div {float: left;}
</style><div><p>S</p></div><div><p>t</p></div><div><p>a</p></div>
<div><p>c</p></div><div><p>k</p></div>

answered Mar 01 '12 at 14:52

jørgensen

10,149
2
20
27

Thank you for replying. I wouldn't argue with you at htis point :D – user1242840 Mar 01 '12 at 16:36
2

That's an easy one; I've seen PDFs where a few characters were placed at the top of the page, then a few more down the left margin, then the rest of the text at the top of the page was placed just after the earlier characters, then some characters in the second column, then a few more after the earlier characters in the first margin, then the top of the page is erased and different characters drawn there, then more characters appended to the second column, then a few graphics strokes, and so on. Madness? THIS! IS! ADOBE!! – Dour High Arch Mar 14 '12 at 19:49

score 0 · Answer 2 · edited Nov 22 '15 at 16:23

I would suggest an open source solution using Java. First you would have to parse the pdf file and extract all the text using Tika.

Then i believe you can achieve this simply by scanning the extracted text and counting the words.

Sample code would look like this:

 if (f.getName().endsWith(".txt")) 
        {
            in = new BufferedReader(new FileReader(f));
            StringBuilder sb = new StringBuilder();
            String s = null;
            while ((s = in.readLine()) != null) 
                sb.append(s);

            String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+");   //to get individual terms

        }

In tokenizedTerms array , you wil have all the terms(words) of the document and you can count them by calling tokenizedTerms.length(). Hope this was useful. :-)

Getting the RIGHT word count of a PDF file

2 Answers2