I am trying to find the language of a PDF document and categorize it. The major problem I face is the document is scanned PDF document. There is no clue of fonts or Unicode.
So Apache Tikka Doesn't do much help here.
I tried using tesseract to convert the document from PDF to text then pass the extracted text to google service it works fine. But there are three problems:
Tesseract is only able to convert high quality images.
It is able to do languages similar to English like Spanish , french but fails for Japanese, Chinese etc.
Document text are confidential and all manipulations should be done within.
Now I am in search of a standalone Language detection component which works across scanned PDF documents.