How would you get count of a given word in a given PDF?

Question

Interview Question

I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.

The question was phrased as following:

How would you get the instance count of a given word in a PDF. The answer doesn't have to be programming, platform, or tool specific. Just let me know how would you do it in a memory and speed efficient way

I am posting this question for following reasons:

To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.

Thanks for your interest.

MK. · Accepted Answer · 2012-01-24T04:08:15.433

If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words. If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.

If I was asking this interviewing question, I'd be looking for a couple of things:

understanding the difference between the setting for this task: one-off script thingy vs production code
not attempting to implement PDF rendered yourself and trying to find a library instead.

Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).

It's not a bad question at all.

I like this question because it goes beyond just assessing if a candidate can code-up a word count algorithm. It makes the candidate demonstrate how he would go about getting real-word work done and if he's thoughtful enough to ask smart questions back to the interviewer for clarification. If I were the interviewer, I might drill in on the implementation of the dictionary (hash, trie, etc...), but also throw curve balls back at the candidate about some of his other decisions to see how he reacts. (e.g. "the pdf file is book written in Chinese - how does that impact your code?"). — selbie, Jan 24 '12 at 06:24
@selbie: Thanks for adding complexity! :) Diverse opinion is what I am looking for! — Manish Basantani, Jan 24 '12 at 16:11

score 2 · Answer 2 · edited Jan 24 '12 at 06:17

2

You can use Trie It is very easy to get the count of given word.

edited Jan 24 '12 at 06:17

akjoshi

15,374
13
103
121

answered Jan 24 '12 at 06:08

Sandeep

7,156
12
45
57

1

You mean "Trie", not "Tire". The latter goes on a car. ;) – selbie Jan 24 '12 at 06:10

score 0 · Answer 3 · edited Nov 22 '15 at 16:24

I would suggest an open source solution using Java. First you would have to parse the pdf file and extract all the text using Tika.

Then I believe the correct question is how to to find the TF(term frequency) of a word in a text. I will not trouble you with definitions because you can achieve this simply by scanning the extracted text and counting the frequency of word.

Sample code would look like this:

 while(scan.hasNext())
    {   
        word = scan.next(); 
        ha += (" " + word + " ");

        int countWord = 0;
        if(!listOfWords.containsKey(word))
        {    
             listOfWords.put(word, 1); //first occurance of this word
        }
        else
        {
            countWord = listOfWords.get(word) + 1; //get current count and increment
                                                       //now put the new value back in the HashMap
            listOfWords.remove(word);              //first remove it (can't have duplicate keys)
            listOfWords.put(word, countWord);      //now put it back with new value
        }
    }

How would you get count of a given word in a given PDF?

3 Answers3

Linked