-1

I have data for two languages, english and korean, i have already indexed data for the english language, i need to index data for the korean language. I did some research and found that there are inbuilt support for few languages, but i cant find the korean language explicitly over there, like how i can see for other languages e.g. german,french etc. I m stuck in how to do it for korean language.

I tried using CJK tokenizer on a field say field 1 which is text_general in the schema so i created a copy and put it as a text_general_cjk but i got error as invalid unknown_field_type fieldname text_general_cjk

Below is my schema, i need to update only asr_hypothesis, nlg_output, nlu_utterance, file can contain data in any of the two languages. so the schema should be able to detect the specif language and index accordingly

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="default-config" version="1.6">

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- docValues are enabled by default for long type so we don't need to index the version field  -->
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="sid" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="model_id" type="strings" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="language_code" type="strings" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="country_code" type="strings" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="client_datetime" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="bixby_version" type="strings" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="resource_flag" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="command_mode_04" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="command_mode_08" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="utterance_type" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="output_method" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="audio_length" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="asr_hypothesis" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="asr_silence" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="agent" type="strings" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="command_name" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="screen_states" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="rule_id" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="is_root" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="app_list" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="execute_app" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="event_1010_rule_id" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="is_complete_generation_time" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="is_complete" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="landing_type" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlg_output" type="text_general" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="thumbs_result" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="close_type" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="event_22" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="chatbot_resp_id" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlu_utterance" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="nlu_matched_domain" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="nlu_display_text" type="text_general" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlg_display_text" type="text_general" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="dc_agent" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlu_bixby_state_ids" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="user_type" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="rule_chooser_result" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="fe_client_time" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="command_type" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="completeness" type="text_general" indexed="true" stored="true" multiValued="false" default=" "/>
<field name="fr_om" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="event_28" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="event_29" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="event_31" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="event_32" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="event_33" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlu_open_qa_session_id" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlu_is_open_qa_session" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlu_viv_capsule" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="nlu_viv_goal" type="strings" indexed="false" stored="true" multiValued="false" default=" "/>
<field name="yyyymmdd" type="strings" indexed="true" stored="true" multiValued="false" default=" "/>   
James Z
  • 12,209
  • 10
  • 24
  • 44
  • please share your schema – Mysterion Dec 11 '17 at 14:13
  • Please read [Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers?](//meta.stackoverflow.com/q/326569) - the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions. – halfer Dec 11 '17 at 15:05

1 Answers1

0

It's not enough to just append cjk to your field type so it will magically start working:

You need to specify fieldType with name text_general_cjk in the schema. Below is very simple example, which you should expand taking into account your needs:

    <fieldType name="text_general_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

It's just use special ICUTokenizer, which is suitable for CJK languages as well. You could add some more depending on your needs, the list is there (take a look at CJK specifics as well)

After this, you could add the field:

<field name="text_cjk" type="text_general_cjk" indexed="true" stored="false"/>

and only after this you will be able to index your documents with this field. Do not forget that you need to restart Solr and reindex after you will make following changes in the schema.

Since ICU filters are not the part of default Solr libs, you need to append it in the solrconfig.xml with lucene-analyzers-icu jar

Mysterion
  • 9,050
  • 3
  • 30
  • 52
  • Thanks for the information, but do i need to change something in my solrconfig.xml as well? – Shalav Saket Dec 11 '17 at 16:21
  • test_shard1_replica_n2: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core test_shard1_replica_n2: Can't load schema managed-schema: Plugin init failure for [schema.xml] fieldType "text_general_cjk": Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.ICUTokenizerFactory' test_shard2_replica_n2: – Shalav Saket Dec 11 '17 at 16:23
  • I am new to solr and i am running solr in cloud mode, so can you please elaborate a little on icu analyzer token, i mean how can i use it ? where do i have to put the jar files? – Shalav Saket Dec 11 '17 at 16:47