how to extract text from a selection of pages in a larger pdf using pymupdf?

Question

I know there are many libraries to extract text from PDF. Specifically, I've been having some difficulty with pymupdf. From the documentation here: https://pymupdf.readthedocs.io/en/latest/app4.html#sequencetypes I was hoping to use select() to pick an interval of pages, and then use getText() This is the doc I am using linear_regression.pdf

import fitz
s = [1, 2]
doc = fitz.open('linear_regression.pdf')
selection = doc.select(s)
text = selection.getText(s)

But I get this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-c05917f260e7> in <module>()
      6 # print(selection)
      7 # text = doc.get_page_text(3, "text")
----> 8 text = selection.getText(s)
      9 text

AttributeError: 'NoneType' object has no attribute 'getText'

So I'm assuming select() is not being used right thanks so much

The problem is `doc.select(s)` is returning `None`. However, you do not define `doc` here so it's unclear why this is. Please edit your question to provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Kraigolas, Jun 01 '21 at 03:20
Thanks, @Kraigolas, I edited the post based on your feedback — Katie Melosto, Jun 01 '21 at 04:27

score 3 · Accepted Answer · answered Jun 01 '21 at 04:45

select here, according to the documentation, modifies doc internally and does not return anything. In Python, if a function does not explicitly return anything, it will return None, which is why you see that error.

However, Document provides a method called get_page_text which allows you to get the text from a specific page (0 indexed). So for your example, you could write:

import fitz
s = [1, 2] # pages 2 and 3
doc = fitz.open('linear_regression.pdf')
text_by_page = [doc.get_page_text(i) for i in s]

Now, you have a list, where each item in the list is the text from a different desired page. A simple way to convert this to a string is:

text = ' '.join(text_by_page)

which joins the two pages with a space between the last word of the first page and the first word of the last (as if there was no page break at all).

This is super helpful @Kraigolas I have a better sense of how to use pyMupdf — Katie Melosto, Jun 01 '21 at 04:52

how to extract text from a selection of pages in a larger pdf using pymupdf?

1 Answers1