Upd: solved, see the comments below.
When I try to read the contents of some pdf files I get an empty string. I have noticed that this happens to pdf files which encoding is none
, and it works fine for pdf files which are identified as base64
. The other suspect is the size of the file, perhaps pygithub fails to read big files. Obviously, without reading the file I cannot apply OCR.
This happens when I read the entire directories on github and copy them to another cloud storage. I don't have a fixation on any pdf file in particular.
The alternative to pygithub is REST API called through requests
package, I will try it later.
Pdf file I used is this one, and it's the same with other pdf files that use languages with special characters.
from github import Github
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_raw = repo.get_contents("20200910-BETA8-ROTULACION-INTERIOR-BOCETO-final.pdf")
print(cont_raw.size, len(cont_raw.content), cont_raw.encoding)
# output: 1283429 0 none