I’m parsing PDF files that show info for multiple different shipments of items. Data includes addresses, commodity amount, etc. I have successfully pulled the string of text that constitutes substance of each file. Files are relatively consistent in their presentation, but don’t offer the ease of locating data like HTML or XML. Firstly, I’m trying to pull number of items. In the text, there are multiple instances of the sub-string “TOTAL BOXES:
”. After each one, there is an integer (so it looks something like this: “TOTAL BOXES: 3
”)
My method, as seen in below code (all the way at the bottom), has been:
- Locate instances of the key phrase “
TOTAL BOXES:
” - Find index of each instance of “
TOTAL BOXES:
” - Use the index of the last character in this sub-string – “
:
” in this case – to “move forward
” 2 character index positions to pull data.
I assume there are probably more elegant solutions, and I’d be thrilled to hear them. But right now my main stumbling block with my chosen approach is:
I’m able to return each index of the key phrase as an item in a list. Then I add 2 to that index to get the “back-end” index. I now know the exact index or each place in the text where it provides targeted data. Each index is stored as a list item under my variable, instance_begin
.
This is where my code falls apart and my newbiness shines bright. In an effort to get data, I do this:
for boxes in instance_begin:
box = raw_data[(instance_begin[box]):(instance_end[box])]
Which returns the exception:
TypeError: list indices must be integers, not list
Help is appreciated.
Code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
from re import findall, finditer
path = "/file.pdf"
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
raw_data = convert_pdf_to_txt(path)
key_phrase = "TOTAL BOXES:"
instance_begin = [i.end() for i in re.finditer(key_phrase, raw_data)]
instance_end = [(i + 2) for i in instance_begin]
box = raw_data[(instance_begin[box]):(instance_end[box])]