-2

Attempting to read the daily works of a Parliament, I discovered the documents are splintered into many PDF documents which cannot be simply opened by the browser to read and must be downloaded individually. My basic idea is to download all the docs and extract the titles of all the decisions taken

Previous threads suggest using PyPDF2. Apparently this does not work at all in my case. The characters in the PDF are greek letters so perhaps the encoding has something to do with it. On top of that, at the end of the document, there are some pictures added (which are of no interest to me).

Is there any chance PyPDF2 can pull this off or should I look elsewhere?

ti7
  • 16,375
  • 6
  • 40
  • 68
Akenaten
  • 91
  • 9
  • Can you provide link to PDF? – Alderven Jan 15 '19 at 09:33
  • https://ufile.io/jr16v – Akenaten Jan 15 '19 at 09:35
  • what do you actually want to do with the document? do you just want the text content? do you care about layout/formatting? – Sam Mason Jan 15 '19 at 09:50
  • 2
    I managed it with pdfminer. Someone here figured how to use it as a library and it works just fine. I wanted to extract the entire text and then cut out the snipets that are of interest. – Akenaten Jan 15 '19 at 10:30
  • 2
    @Akenaten - I strongly suggest rethinking how you post here, especially with usage of terms such as "retarded". And leave the political rants at home, not here. – David Makogon Jan 15 '19 at 11:38
  • @DavidMakogon Yes, you are right. My apologies for drifting off topic. – Akenaten Jan 15 '19 at 12:46
  • I reworded this to be more neutral after your Question appeared in the review queue from a title change. Feel free to revert if my change isn't appropriate! – ti7 Sep 14 '20 at 15:26

1 Answers1

5

if you're just after the text, it seems that PyPDF2 doesn't support CMaps and you'll therefore get garbage back if you try to do:

from PyPDF2 import PdfFileReader

with open('document.pdf', 'rb') as fd:
  pdf = PdfFileReader(fd)
  p1 = pdf.getPage(0)
  print(p1.extractText())

there's an open pull request to fix this. it's not been merged, but you could pull that code out if you want it as it looks pretty self contained.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • 1
    The patch is 4 years old. It is a shame it isn't accepted. How can I gently nudge the mantainers? – Massimo Jan 04 '20 at 10:47
  • 3
    @Massimo PyPDF2 looks somewhat unmaintained, you could try another fork. [PyPDF4](https://github.com/claird/PyPDF4) looks in better shape but still somewhat unmaintained. you could even try one of the more [recent forks](https://github.com/claird/PyPDF4/network). submitting an updated patch to one with an active maintainer is your best bet – Sam Mason Jan 04 '20 at 14:25