2

I am trying to extract text from some Pdfs. For this purpose I am using PyMuPDF library (1.19.2) in Python. I am however having some trouble understanding the orientation of pages and images in the Pdfs. When I look at the PDF in Adobe reader, the page appears in correct orientation. However when I check the page rotation in Python using the following code, I get a rotation of 270.

doc = fitz.open(document_name)
doc[0].rotation

Now when I extract an embedded image from the page using the following code

import PIL
from io import BytesIO
img = doc[0].get_images()
image = PIL.Image.open(BytesIO(doc.extract_image(img[0][0])['image']))

I get an image which is rotated consistent with the page rotation I obtained above. The image is shown below

enter image description here

However, if I extract the pixmap of the page using the following code

PIL.Image.open(BytesIO(page.get_pixmap().tobytes()))

The page appears in the orientation which also appears in Adobe reader but not the orientation of embedded image or the rotation value returned above. This image is shown below

enter image description here

My question is what do the rotation values mean and how can I make sure I am extracting correctly oriented images and pages from the PDF?

1 Answers1

1

The first key to understanding rotations in pymupdf is found in the following code snippet from documentation.

>>page.set_rotation(90)  # rotate an ISO A4 page

>>page.rect
Rect(0.0, 0.0, 842.0, 595.0)

>>p = fitz.Point(0, 0)  # where did top-left point land?

>>p * page.rotation_matrix
Point(842.0, 0.0)

So, the top-left point in unrotated view has moved according to the rotation matrix to bottom-left.

Now, regarding the difference in the outputs of different functions for you,

  • the pixmap are by default made with from page rectangle, i.e. with rotation (ref)
  • in the extract_image, the reference is being used to generate the image you are extracting. You can explore the details of this image, i.e. the transformation it has undergone by running this command: fitz.image_profile(doc.xref_stream_raw(xref)). In your case, the xref is given by img[0][0]. The attributes you are interested in are orientation and transformation (ref).

Additionally, reading the appendix on image transformation matrix might help you further.

Hope this helps understand how the rotation works and, thus, how to extract images with desired rotation (Hint: put checks using the rotation or set rotations before performing operations).