Detect language/script from pdf with python

Question

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)

I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.

So a sort of two step OCR as you will that tesseract both can preform

Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages

Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.

#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz

pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
    print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice  =0, timeout=0))

I run the script as follows:

script_detect.py myunknown.pdf

I am getting the following error atm:

TypeError: Unsupported image object

score 0 · Answer 1 · answered Oct 17 '20 at 11:54

0

Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect

from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)

```output fr````

or

from langdetect import detect
lang = detect("我是法国人")
print(lang)

output ch There are other libraries, such as polyglot, useful if you have mixed languages.

answered Oct 17 '20 at 11:54

Serge de Gosson de Varennes

7,162
3
18
39

No it is before any text is OCR-ed. So imagine I have a huge pdf (500 pages). I like to take the middle page or pages and then parse this via tesseract script detection capabilities. I got the idea from here: https://github.com/jbarlow83/OCRmyPDF/issues/39#issuecomment-415963801 – Bastiaan Wakkie Oct 17 '20 at 12:44
I see. But if you already are using ```tesseract```, why not OCR the document? Even the Github issue you are referring to suggests that. I think you are complicating things or not using the right tool for the task. If the point is to detect a language in an image (or view the pdf as such), then tesseract is not the appropriate tool. Applying a deep learning model would be more suitable. – Serge de Gosson de Varennes Oct 17 '20 at 14:52
I need in order to parse the OCR via tesseract pass a language. This is the unknown. As a lot of the documents (10000) I have it is not standard english. So this is why I need to detect the language first before sending it to the real OCR. – Bastiaan Wakkie Oct 17 '20 at 15:09
Ah! Now I get your dilemma. I'll give it some thought. – Serge de Gosson de Varennes Oct 17 '20 at 15:12

Detect language/script from pdf with python

1 Answers1

Linked