0

I’m parsing PDF files that show info for multiple different shipments of items. Data includes addresses, commodity amount, etc. I have successfully pulled the string of text that constitutes substance of each file. Files are relatively consistent in their presentation, but don’t offer the ease of locating data like HTML or XML. Firstly, I’m trying to pull number of items. In the text, there are multiple instances of the sub-string “TOTAL BOXES:”. After each one, there is an integer (so it looks something like this: “TOTAL BOXES: 3”)

My method, as seen in below code (all the way at the bottom), has been:

  1. Locate instances of the key phrase “TOTAL BOXES:
  2. Find index of each instance of “TOTAL BOXES:
  3. Use the index of the last character in this sub-string – “:” in this case – to “move forward” 2 character index positions to pull data.

I assume there are probably more elegant solutions, and I’d be thrilled to hear them. But right now my main stumbling block with my chosen approach is:

I’m able to return each index of the key phrase as an item in a list. Then I add 2 to that index to get the “back-end” index. I now know the exact index or each place in the text where it provides targeted data. Each index is stored as a list item under my variable, instance_begin.

This is where my code falls apart and my newbiness shines bright. In an effort to get data, I do this:

for boxes in instance_begin:

box = raw_data[(instance_begin[box]):(instance_end[box])]

Which returns the exception:

TypeError: list indices must be integers, not list

Help is appreciated.

Code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
from re import findall, finditer

path = "/file.pdf"

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

raw_data = convert_pdf_to_txt(path)

key_phrase = "TOTAL BOXES:"

instance_begin = [i.end() for i in re.finditer(key_phrase, raw_data)]

instance_end = [(i + 2) for i in instance_begin]

box = raw_data[(instance_begin[box]):(instance_end[box])]
Murcielago
  • 1,030
  • 1
  • 14
  • 24
  • The line of code that you say is a problem isn't in your source listing, so it's of course impossible to figure out what you're doing wrong. The error message is telling you that the list index (which can only be the variable `box`) is a list, not an integer. Python is always right about such things. – Paul Cornelius Aug 08 '15 at 08:09
  • I edited my question to include the non-functional code. I understand that the list index must be an integer. My issue is using an item from a list (which is a set of integers) as the index. Any thoughts? – Murcielago Aug 08 '15 at 13:34

1 Answers1

0

Let me summarize my understanding of your problem. You have a long string named raw_data. You want to slice certain 2-character sequences from this string. The indices where these slices begin are stored in a list instance_begin. If that is correct, here is a one-line solution:

box = [raw_data[i:i+2] for i in instance_begin]

At the end of this statement box is the desired list of two-character strings. The list instance_end is not necessary. Apologies if I still misunderstand your problem.

Paul Cornelius
  • 9,245
  • 1
  • 15
  • 24