0

I am trying to find the language of a PDF document and categorize it. The major problem I face is the document is scanned PDF document. There is no clue of fonts or Unicode.

So Apache Tikka Doesn't do much help here.

I tried using tesseract to convert the document from PDF to text then pass the extracted text to google service it works fine. But there are three problems:

  • Tesseract is only able to convert high quality images.

  • It is able to do languages similar to English like Spanish , french but fails for Japanese, Chinese etc.

  • Document text are confidential and all manipulations should be done within.

    Now I am in search of a standalone Language detection component which works across scanned PDF documents.

Community
  • 1
  • 1
karthikselva
  • 123
  • 8
  • Does that document have the same content in different languages or different content across different languages? – karthick Mar 26 '13 at 11:55
  • Do you have any knowledge on the language or content before processing? – mkl Mar 26 '13 at 11:58
  • It is a hybrid document. Sometimes first part of the document is in one language and the second part of the document is exact translation of the first part. This is simple and easy to handle. But in other cases the document starts with one language and gets amended by different language people.They just combine all these difference language snippets into single PDF. I am having hard time with these kind of Hybrid documents. – karthikselva Mar 26 '13 at 12:00
  • @karthikselvakumar: If it's hybrid then what should be your output ? Because your question is Language detection. What do you need to do exactly? – karthick Mar 26 '13 at 12:04
  • Dominant language across the document. If I am able to detect Japanese , Chinese present in the document then it is sufficient. – karthikselva Mar 26 '13 at 12:08

0 Answers0