-3

At the moment my code is extracting data out of a PDF & counting the word frequency. I've been trying for a while now to arrange it in order of frequency but haven't been able to. I've looked at multiple similar answers but can't find an answer that I can get to work. Can someone point out what I need to do?

import PyPDF2
import re


pdfFileObj = open('ch8.pdf', 'rb') #Open the File
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #Read the file
frequency = {} #Create dict

print "Number of Pages %s " % pdfReader.numPages #Print Num Pages

pageObj = pdfReader.getPage(0) # Get the first page
match_pattern = re.findall(r'\b[a-z]{3,15}\b', pageObj.extractText()) #Find the text

for word in match_pattern: #Start counting the frequency
    word = word.lower()
    count = frequency.get(word,0)
    frequency[word] = count + 1


frequency_list = frequency.keys() 

for words in frequency_list:
    print words, frequency[words]

Thanks in Advance.

Trent
  • 1
  • 1
    Have you tried to use `Counter` ? You can run a counter on it and then sort by `most_common`. Here's some info on it: https://docs.python.org/2.7/library/collections.html#collections.Counter.most_common – serk Feb 17 '17 at 01:06
  • Lazy title (could be used for every question on SO!), lazy question. Basic troubleshooting: start with the simplest possible input, see what your code does with that. If you still can't figure out what's going on, provide your input, your output, what output you were expecting, what you've tried, and what happened when you tried it. – Jonathan March Feb 17 '17 at 01:10

1 Answers1

0

Looking at your Python, logically everything looks good and syntactically. I'd assume something is going wrong with your method of extraction because I tried this code with a couple minor changes on a pdf of 4 words and none were scraped. I have no experience with pyPDF2 so I can't offer much more advise than the idea that you should try a different extraction method for the text if possible.