0

(I know that pdfplumber is mainly geared towards computer-generated PDFs. However, before I spend a couple of days handtyping data from my scanned PDFs, I thought I'd ask if pdfplumber could somehow help me.)

My problem:
I have scanned PDFs from historical books.
Example: Data from statistical yearbook
Now I'm trying to extract the table (the one in the lower-right in the example) from the scanned PDF.

My first attempts at extracting the table with pdfplumber didn't work.
e.g.

with pdfplumber.open('test.pdf') as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)

returned None

Is there any hope that I will be able to extract this kind of data non-manually? Or should I just suck it up?

Thanks in advance for any help or advice!

Tototulbi
  • 15
  • 4
  • 1
    Thanks a lot for your help! I scanned the books myself. I didn't really notice the bleed-through as a problem. Doing it again I could simply add a blank sheet in between. However, rescanning all the books would cost me at least a day and some transportation costs. Typing everything will only take me 2-4 days. So it seems unlikely that trying hard is worth the time. Considering this I might just play the typist (typist == programmer in my case ;-) ). – Tototulbi Nov 18 '21 at 16:02

1 Answers1

0

No, a scanned pdf contains actually an image inside. You can read the image as shown below but that will not help you to get the data. You could get the data using some tools that can analyze the image, but that's a ifferent story.

from pikepdf import Pdf, PdfImage

filename = "sample-in.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")

Also this question can help you understand what and how to use if it's mandatory for you to get that data

Alexandru DuDu
  • 998
  • 1
  • 7
  • 19