How to Data Extract from Unstructured PDFs using PyMuPDF in python?

Question

I am following this guide on how to extract data from Unstructured PDFs using PyMuPDF.

https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/

I am getting an AttributeError: 'NoneType' object has no attribute 'rect' error when I followed the code and not sure what is going on since I am new to Python.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-7f394b979351> in <module>
      1 first_annots=[]
      2 
----> 3 rec=page1.first_annot.rect
      4 
      5 rec

AttributeError: 'NoneType' object has no attribute 'rect'

Code

import fitz
import pandas as pd 
doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')
page1 = doc[0]
words = page1.get_text("words")
words[0]

first_annots=[]

rec=page1.first_annot.rect

rec

#Information of words in first object is stored in mywords

mywords = [w for w in words if fitz.Rect(w[:4]) in rec]

ann= make_text(mywords)

first_annots.append(ann)

def make_text(words):

    line_dict = {} 

    words.sort(key=lambda w: w[0])

    for w in words:  

        y1 = round(w[3], 1)  

        word = w[4] 

        line = line_dict.get(y1, [])  

        line.append(word)  

        line_dict[y1] = line  

    lines = list(line_dict.items())

    lines.sort()  

    return "n".join([" ".join(line[1]) for line in lines])

print(rec)
print(first_annots)

oh...I just ended up using another pdf because I was using it as a guide — shuynh84, May 09 '22 at 22:45
I think so...would it cause a code error AttributeError: 'NoneType' object has no attribute 'rect'? — shuynh84, May 10 '22 at 02:43
I am facing the same error too : AttributeError: 'NoneType' object has no attribute 'rect' — Mech_Saran, Oct 28 '22 at 00:38

david · Answer 1 · 2022-08-27T06:56:23.050

0

right after this line:

doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')

add this to check if there is any annots in pdf, you might end up with no annotations at all in your pdf, so your page.first_annot is NoneType.

if doc.has_annots():

print("has annots")

else:

print("no annots")

edited Aug 27 '22 at 06:56

answered Aug 27 '22 at 06:53

david

1
1

How to Data Extract from Unstructured PDFs using PyMuPDF in python?

1 Answers1