1

Trying to iterate through each line of the page from the PyMuPDF library to check the length of the sentence, if it is less than 10 words then I would like to add a full stop. Psuedo code would be:

#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words 
#add period 

Real code below:

import fitz
myfile = "my.pdf"
doc  =fitz.open(myfile)
page=doc[0]
for page in doc:
    text = page.getText("text")
    print(text)

when I add another for loop e.g for line in page:

I get an error saying page is not iterable. Is there any other way I can do this?

Thanks

1 Answers1

0

in order to iterate over page lines you can use getDisplayList:

page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
    for line in block['lines']:
        line_text = ''
        for span in line['spans']:
             line_text += ' ' + span['text]
        print(l