1

Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?

Any Python-based or Tesseract-OCR based solution would be appreciated.


Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)

Munib
  • 957
  • 1
  • 14
  • 30
Gokul NC
  • 1,111
  • 4
  • 17
  • 39

1 Answers1

2

Pre-requisites:

  • Install Tesseract: sudo apt install tesseract-ocr tesseract-ocr-all
  • Install PyTessract: pip install pytesseract

Script-Detection:

import pytesseract
import re

def detect_image_lang(img_path):
    try:
        osd = pytesseract.image_to_osd(img_path)
        script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
        conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
        return script, float(conf)
    except e:
        return None, 0.0

script_name, confidence = detect_image_lang("image.png")

Language-Detection:

After performing OCR (using Tesseract), pass the text through langdetect library (or any other lib).

Gokul NC
  • 1,111
  • 4
  • 17
  • 39
  • [Check here for list of all scripts & languages supported by Tesseract OCR](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). – Gokul NC Dec 02 '21 at 12:37