0

Upd: solved, see the comments below.

When I try to read the contents of some pdf files I get an empty string. I have noticed that this happens to pdf files which encoding is none, and it works fine for pdf files which are identified as base64. The other suspect is the size of the file, perhaps pygithub fails to read big files. Obviously, without reading the file I cannot apply OCR.

This happens when I read the entire directories on github and copy them to another cloud storage. I don't have a fixation on any pdf file in particular.

The alternative to pygithub is REST API called through requests package, I will try it later.

Pdf file I used is this one, and it's the same with other pdf files that use languages with special characters.

from github import Github

github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_raw = repo.get_contents("20200910-BETA8-ROTULACION-INTERIOR-BOCETO-final.pdf")
print(cont_raw.size, len(cont_raw.content), cont_raw.encoding) 
# output: 1283429 0 none
Yulia V
  • 3,507
  • 10
  • 31
  • 64
  • @KJ thanks. If you could write it as an answer, happy to tick and upvote. – Yulia V Jul 01 '23 at 11:05
  • @KJ I can take it from here, plenty of info on how to get the raw files from github. So more than happy to tick and upvote. – Yulia V Jul 01 '23 at 11:21

2 Answers2

3

This PDF file does not contain any text or fonts. What looks like text is just ordinary PDF filled shapes.

So, you have no choice but to rasterize and OCR.

In this particular example, it has nothing to do with the language or "encoding" in use.

johnwhitington
  • 2,308
  • 1
  • 16
  • 18
  • I need a programmatic solution that would work for any pdf – Yulia V Jun 30 '23 at 18:46
  • I have mentioned the encoding because for all pdfs that fail to be read the value of encoding is `none`. – Yulia V Jun 30 '23 at 18:47
  • 2
    There is no such thing as a PDF's "encoding". The only solution which works for any PDF is to rasterize and OCR, as I have said. – johnwhitington Jun 30 '23 at 21:04
  • When pygithub reads files, it returns an object that contains the field called encoding. I have updated the question, hope it's clearer now. – Yulia V Jul 01 '23 at 09:35
0

To understand your problem, you need to understand PDF in a bit more detail.

You see, PDF is not a WYSIWYG (what you see is what you get) format. If we were to look at an .html document, you'd recognize the text of the page, and you'd be able to derive other information such as:

  • These characters belong together in a paragraph
  • These paragraphs make up a column in a row, in a table
  • etc

PDF is more like a programming language. Inside a PDF you'll find a special kind of datastructure (called a stream) that represents the contents of a Page.

Each of these content streams is essentially a compressed piece of text, representing postscript (a programming language) instructions.

In pseudo-code, you might find things like:

  • go to position 40, 450
  • set the stroke color to black
  • set the font to Helvetica, size 12
  • render the character with ID 12
  • go to position 45, 450
  • etc

now imagine that in stead of using a Font, your instructions would be somewhat like:

  • go to position 40, 450
  • set the stroke color to black
  • stroke the following path: .... (which happens to render 'H')
  • etc

Because there is no Font, and no character ID (abbreviated as cid), there is no way of knowing what the underlying text is. The only thing any reader/parser software would see is "this page contains some vector graphics".

You have images not text.

The best way forward would be to convert your entire PDF to images (perhaps using a tool such as ghostscript) and then apply OCR to the resulting image.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • Thank you for the detailed response, but, unfortunately, it does not help in my context. I have updated the question, hope it's clearer now. – Yulia V Jul 01 '23 at 09:35