0

I am trying to extract data from a PDF, but I keep getting a type error because my object is not iterable (on the statement for line in text: but I don't understand why 'text' has no value, just above that I create the text object using text = page.extract.text() and then I want to iterate through each line of the text to find matches to my regexes.

I'm afraid that my statement for line in text: is the problem; perhaps using 'line' isn't appropriate, but I don't know what else to do.

My code is below, thanks for looking!

import requests
import pdfplumber
import pandas as pd
import re
from collections import namedtuple

Line = namedtuple('Line', 'gbloc_name contact_type email')

gbloc_re = re.compile(r'^(?:a\.\s[A-Z]{5}\:\s[A-Z]{4})')

line_re = re.compile(r'^[^@\s]+@[^@\s]\.[^@\s]+$')

file = 'sampleReport.pdf'
  
lines=[]

with pdfplumber.open(file) as pdf:
    pages = pdf.pages 
    for page in pdf.pages: 
        text = page.extract_text() 
        for line in text: 
            gbloc = gbloc_re.search(line) 
            if gbloc:
                gbloc_name = gbloc

            elif line.startswith('Outbound'):
                contact_type = 'Outbound'
            
            elif line.startswith('Tracing'):
                contact_type = 'Tracing'
            
            elif line.startswith('Customer'):
                contact_type = 'Customer Service'

            elif line.startswith('QA'):
                contact_type = 'Quality Assurance'
            
            elif line.startswith('NTS'):
                contact_type = 'NTS'

            elif line.startswith('Inbound'):
                contact_type = 'Inbound'
            
            elif line_re.search(line):
                items = line.split()
                lines.append(Line(gbloc_name, contact_type, *items))

2 Answers2

0

Try setting the loop directly equal to the page.extract_text() value. Like this:

with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        for line in page.extract_text():
DapperDuck
  • 2,728
  • 1
  • 9
  • 21
  • Thank you, but the same error occurred. I changed the code to: `lines=[] with pdfplumber.open(file) as pdf: for page in pdf.pages: for line in page.extract_text(): gbloc = gbloc_re.search(line) if gbloc: gbloc_name = gbloc` – Don Carroll Jan 10 '21 at 01:57
  • I can probably update my answer with some testing. Please include the pdf file you are using – DapperDuck Jan 10 '21 at 01:59
  • Thank you! the file is here: [link](https://move.mil/sites/default/files/inline-files/CONUS%20PERSONAL%20PROPERTY%20CONSIGNMENT%20INSTRUCTION%20GUIDE%20%28January%202021%20v32%29.pdf) – Don Carroll Jan 10 '21 at 02:02
  • Where exactly does the error occur? Because I ran the code, and it worked fine for me – DapperDuck Jan 10 '21 at 02:59
  • Thank you @DapperDuck, I did need to 'if not text: continue' to get it to work - I'm not sure why it worked for you without that - but I very much appreciate your help! – Don Carroll Jan 10 '21 at 15:39
0

I used lib PyPDF2 to extract text from PDF. Here, i made a simple source code. It will extract the content by page.

import PyPDF2

with open('example.pdf', 'rb') as pdfFileObj:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    print(pdfReader.numPages)
    for i in range(0, pdfReader.numPages):
        print("Page: ", i)
        pageObj = pdfReader.getPage(i)
        print(pageObj.extractText())

Image result:

Result

Please check and respond to me if you have any issue.