4

I recently encountered a problem and would love to hear any thoughts on this subject.

Precondition:

  • Lucene Implementation Version: 2.9.1
  • Solr: 1.4
  • Java 6
  • Large and heavy index in store :)

Main idea: Change JDK version from 1.6 to 1.8.

So, does this change required re-indexing of index or not?

The first thing I found was JRE_VERSION_MIGRATION document. But it says only about one-known problem associated with changes in the Unicode version and Java 1.4 to Java 5 transition. I dont found any other known issues with Unicode versions in different JDK versions, that can requered full reindexing for existing Lucene index.

Also, does anybody know some issues related to different versions of Unicode in JKD 1.6 and JDK 1.7(1.8)?

Thanks!

P.S. Additionally, this is a list of all analyzers and filters that used for:

  • WhitespaceTokenizerFactory
  • WordDelimiterFilterFactory
  • LowerCaseFilterFactory
  • SnowballPorterFilterFactory
  • RemoveDuplicatesTokenFilterFactory
  • ElisionFilterFactory
  • CJKTokenizerFactory
  • ThaiWordFilterFactory
  • ChineseSentenceTokenizerFactory
  • ChineseWordTokenFilterFactory

1 Answers1

0

I doubt you will need re-indexing. Unicode 6.1 added these symbols that might be "seen" by CJK analyzer:

CJK Compatibility Ideographs {F900..FAFF} : 2 characters (U+FA2E and U+FA2F)
CJK Unified Ideographs {4E00..9FFF} : 1 character (U+9FCC = Adobe-Japan1-6 CID+20156, a variant of U+6DBC 涼) 

Other changes will not even theoretically affect these analyzers.

Unicode 6.2 version was even simpler, it just had one new character

U+20BA  TURKISH LIRA SIGN    

I see no need to re-index. Tokenizers above rely on Character.isLetter() method that was not affected by above changes. I seriously doubt any of the characters listed in changes were in the index to beging with.

Alex Pakka
  • 9,466
  • 3
  • 45
  • 69