1

I have couple of hundreds of newspapers in pdf format and a list of keywords. My ultimate goal is to get the number of articles mentioning a specific keyword keeping in mind that one pdf might contain multiple articles mentioning the same keyword.

My problem is that when I converted the pdf files to plain text I lost the formatting which makes it impossible to know when an article start and when it ends.

What is the best way to approach this problem because right now I'm thinking it is impossible.

I am currently using python for this project and the pdf library pdfminer. Here is one of the pdfs. http://www.gulf-times.com/PDFLinks/streams/2011/2/27/2_418617_1_255.02.11.pdf

Jiyda Moussa
  • 925
  • 2
  • 9
  • 26
  • There is no reasonable tool in the Python world for doing what you want. Most tools drop layout information and in addition: boxes of texts are not necessarily linked together in order to get an idea what belongs to what. There are possible some expensive commercial tools but nothing suitable tools available as open-source. –  Jan 12 '13 at 07:27

1 Answers1

1

Depending on the format of the text you might be able to come up with some sort of heuristic which identifies a headline - say, it's a line on its own with fewer than 15 words and it doesn't contain a full stop/period character. This will get confused by things like the name of the newspaper, but hopefully they won't have significant amounts of "non-headline" text after them to mess up the results too badly.

This relies on the conversion to text having left every article contiguous (as opposed to just ripping raw columns and mixing the article up). If they're mixed up, I'd say you have very little chance - even if you can find a PDF library which maintains formatting, it's not necessarily easy to tell what constitutes an article's "bounding box". For example, many papers put callouts and other features which could confuse even quite an advanced heuristic.

Actually doing the counting is simple enough. If the assumptions I've mentioned hold, your could would likely end up looking like:

import re
import string

non_word_re = re.compile(r"[^-\w']+")

article = ""
for filename in list_of_text_files:
    with open(filename, "r") as fd:
        for line in fd:
            # Split line on non-word characters and lowercase them for matching.
            words = [i.lower() for i in non_word_re.split(line)
                     if i and i[0] in string.ascii_letters]
            if not words:
                continue
            # Check for headline as the start of a new article.
            if len(words) < 15 and "." not in line:
                if article:
                    # Process previous article
                    handle_article_word_counts(article, counts)
                article = line.strip()
                counts = {}
                continue
            # Only process body text within an article.
            if article:
                for word in words:
                    count[word] = count.get(word, 0) + 1
    if article:
        handle_article_word_counts(article, counts)
    article = ""

You'll need to define handle_article_word_counts() to do whatever indexing of the data you want, but each key in counts will be a potential keyword (including things like and and the, so you may want to drop the most frequent words or something like that).

Basically it depends how accurate you want the results to be. I think the above has some chance of giving you a fair approximation, but it has all the assumptions and caveats I've already mentioned - for example, if it turns out that headlines can span lines then you'll need to modify the heuristics above. Hopefully it'll give you something to build on, at least.

Cartroo
  • 4,233
  • 20
  • 22