How to detect language or script from an input image using Python or Tesseract OCR?

Question

Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?

Any Python-based or Tesseract-OCR based solution would be appreciated.

Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)

score 2 · Accepted Answer · answered Dec 02 '21 at 11:54

Pre-requisites:

Install Tesseract: sudo apt install tesseract-ocr tesseract-ocr-all
Install PyTessract: pip install pytesseract

Script-Detection:

import pytesseract
import re

def detect_image_lang(img_path):
    try:
        osd = pytesseract.image_to_osd(img_path)
        script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
        conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
        return script, float(conf)
    except e:
        return None, 0.0

script_name, confidence = detect_image_lang("image.png")

Language-Detection:

After performing OCR (using Tesseract), pass the text through langdetect library (or any other lib).

[Check here for list of all scripts & languages supported by Tesseract OCR](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). — Gokul NC, Dec 02 '21 at 12:37

How to detect language or script from an input image using Python or Tesseract OCR?

1 Answers1

Linked