I know there are many libraries to extract text from PDF. Specifically, I've been having some difficulty with pymupdf.
From the documentation here: https://pymupdf.readthedocs.io/en/latest/app4.html#sequencetypes
I was hoping to use select()
to pick an interval of pages, and then use getText()
This is the doc I am using linear_regression.pdf
import fitz
s = [1, 2]
doc = fitz.open('linear_regression.pdf')
selection = doc.select(s)
text = selection.getText(s)
But I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-c05917f260e7> in <module>()
6 # print(selection)
7 # text = doc.get_page_text(3, "text")
----> 8 text = selection.getText(s)
9 text
AttributeError: 'NoneType' object has no attribute 'getText'
So I'm assuming select()
is not being used right
thanks so much