Lost information getting pdf page as image

Question

I am not an expert in any sense, I am trying to extract a pdf page as an image to do some processing later. I used the following code for that, that I built from other recommendations in this page.

import fitz
from PIL import Image


dir = r'C:\Users\...'
files =  os.listdir(dir)
print(dir+files[21])
doc = fitz.open(dir+files[21])
page = doc.loadPage(2)
zoom = 2
mat = fitz.Matrix(zoom, zoom)
pix = page.getPixmap(matrix = mat)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

density=img.getdata()

Usually this would give me the pixel information of the image, but in this case it returns a list of white pixels. I have no clue as for what is the reason of this... The image (img) is displayed if asked, but not its data.

I will appreciate any help?

Any reason you use fitz, rather than pdf2image? Also, is direc defined somewhere? — asylumax, Jun 02 '20 at 23:01
I will look into it, I use fitz just because it was the first library I found usefull for what I'm doing. direc is actually dir, sorry about that. Anyway, at the end, I only had to replace doc.loadPage for doc.getPagePixmap and then apply Image.frombytes directly. I still don't know why it doesn't work if I use the long way, and I still would like to use it, as I need to resize my image. Thanks — José Chamorro, Jun 04 '20 at 02:52

asylumax · Answer 1 · 2020-06-02T23:58:05.907

If you want to convert pdf to image, and process, you might use something along these lines. This particular simple example reads in 5 pages of the PDF, and for the last page, looks at what percentage of the image is a particular color; the slow way and fast way.


import pdf2image
import numpy as np

# details:
# https://pypi.org/project/pdf2image/
images = pdf2image.convert_from_path('test.pdf')

# Get first five pages, just for testing
i = 1
for image in images:
    print(i," shape: ", image.size)
    image.save('output' + str(i) + '.jpg', 'JPEG')
    i = i + 1
    if(i>5):
        break

color_test=(128,128,128)
other=0
specific_color=0

# Look at last image
for i in range(image.width):
    for j in range(image.height):
        x=image.getpixel((i,j))
        if(x[0]==color_test[0] and x[1]==color_test[1] and x[2]==color_test[2]):
            specific_color=specific_color+1
        else:
            other=other+1

print("frac of specific color = ", specific_color/(specific_color+other))

# faster!
x=np.asarray(image)
a=np.where(np.all(x==color_test,axis=-1))
print("(faster) frac of color = ", len(a[0])/((image.width)*(image.height)))

José Chamorro · Accepted Answer · 2020-06-07T02:57:38.797

The code works if I take a shorter path and replace doc.loadPage with doc.getPagePixmap

import fitz
from PIL import Image


dir = r'C:\Users\...'
files =  os.listdir(dir)
print(dir+files[21])
doc = fitz.open(dir+files[21])
pix= doc.getPagePixmap(2)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

density=img.getdata()

I still don't know why the long code fails, and the working method doesn't allows me to get a better resolution version of the exctracted page.

Lost information getting pdf page as image

2 Answers2