How to configure Tesseract language for TikaEntityProcessor in Solr

Question

I have a solr core, and i use TikaEntityProcessor in my DataImportHandler.
I have tesseract installed and tika can extract text from images. But the default language is english.

Here is the tika part of my data-import-handler.xml file

<entity processor="TikaEntityProcessor" dataSource="fileDataSource" 
name="file_content" 
url="${item.FilePath}" 
format="text" transformer="TemplateTransformer" 
onError='skip'>
    <field column="text" name="content" />
    <field column="title" meta="true" name="title" />
    <field column="subject" meta="true" name="subject" />
    <field column="description" meta="true" name="description" />
    <field column="Author" meta="true" name="author" />
    <field column="category" meta="true" name="category" />
    <field column="content_type" meta="true" name="content_type" />
    <field column="last_modified" meta="true" name="last_modified" />
</entity>

I also have tur.traineddata and rus.traineddata in tesseract's tessdata folder, and i want to use Turkish as default ocr language. How can i do that ?

How to configure Tesseract language for TikaEntityProcessor in Solr

0 Answers0