How to do case insensitive sorting using Hibernate Lucene Search?

Question

I am able to get results using following code but the results are not sorted correctly. It display lowercase first and then uppercase characters.

Results getting:

upper
test
UPPER
Test

Expected Results;

 upper
 UPPER
 Test
 test

pattern can be any like uppercase (T) first and lowercase(T) after that.

Following is code for reference:

Prada - Entity Class:

@Entity
@Table(name = "Prada")
@XmlRootElement
@Indexed
@AnalyzerDef(name="customanalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), 
    filters = { 
        @TokenFilterDef(factory=ISOLatin1AccentFilterFactory.class),
        @TokenFilterDef(factory=LowerCaseFilterFactory.class)})
public class Prada implements Serializable {
 private static final long serialVersionUID = 1L;
@Id
@Basic(optional = false)
@Column(name = "ID")
private Long id;

@Fields({ @Field(index = Index.YES, store = Store.NO), @Field(name = "PradaName_for_sort", index = Index.YES, analyzer = @Analyzer(definition = "customanalyzer")) })
@Column(name = "NAME", length = 100)
private String name;

public Prada () {
}

public Prada (Long id) {
    this.id = id;
}

public Prada (Long id) {
    this.id = id;

}

public Long getId() {
    return id;
}

public void setId(Long id) {
    this.id = id;
}



public String getName() {
    return name;
}

public void setName(String name) {
    this.name = name;
}


@Override
public String toString() {
    return "com.Prac.Prada[ id=" + id + " ]";
}

}

Found this analyzerDef solution somewhere but didn't worked for me. Could anyone provide me solution to that?

Main Code:

  FullTextEntityManager ftem = Search.getFullTextEntityManager(factory.createEntityManager());
  QueryBuilder qb = ftem.getSearchFactory().buildQueryBuilder().forEntity( Prada.class ).get();
  org.apache.lucene.search.Query query = qb.all().getQuery(); 
  FullTextQuery fullTextQuery = ftem.createFullTextQuery(query, Prada.class);
  fullTextQuery.setSort(new Sort(new SortField("PradaName_for_sort", SortField.STRING, true)));
  fullTextQuery.setFirstResult(0).setMaxResults(150);
  int size = fullTextQuery.getResultSize();
  List<Prada> result = fullTextQuery.getResultList();
  for (Pradauser : result) {
    logger.info("Prada Name:" + user.getName());
  }

Following are versions of Lucene (which i cannot change):

 <hibernate.version>4.2.8.Final</hibernate.version>
    <hibernate.search.version>4.3.0.Final</hibernate.search.version>

  <dependency>
        <groupId>org.hibernate</groupId>
        <artifactId>hibernate-entitymanager</artifactId>
        <version>4.2.8.Final</version>
    </dependency>
<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>3.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers</artifactId>
        <version>3.6.2</version>
    </dependency>

UPDATED CODE:

@AnalyzerDef(name = "customanalyzer",
tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
    @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
        @Parameter(name = "pattern", value = "('-&\\.,\\(\\))"),
        @Parameter(name = "replacement", value = " "),
        @Parameter(name = "replace", value = "all")
    }),
    @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
        @Parameter(name = "pattern", value = "([^0-9\\p{L} ])"),
        @Parameter(name = "replacement", value = ""),
        @Parameter(name = "replace", value = "all")
    }),
    @TokenFilterDef(factory = TrimFilterFactory.class)
}
)
public class Prada implements Serializable {

@Fields({ @Field(index = Index.YES, store = Store.YES), @Field(name = "PradaName_for_sort", index = Index.YES, analyzer = @Analyzer(definition = "customanalyzer")) })
@Column(name = "NAME", length = 100)
private String name;

There is no code for that i have used inbuilt filters -> @AnalyzerDef — fatherazrael, Jul 20 '16 at 04:25
Have you tried to write your custom analyzer as mentioned [in this answer?](http://stackoverflow.com/a/11792535/2815219) — Raman Sahasi, Jul 20 '16 at 04:39
It is sorting fine but it is not sorting special characters in Norwegian or other languages. Do you know how to fix that? — fatherazrael, Jul 20 '16 at 05:21
You mean to say when [you use this analyzer](http://stackoverflow.com/a/11792535/2815219), it given the results in the order that you want it to, viz., `upper, UPPER, Test, test`, but it doesn't sort special characters? Right? — Raman Sahasi, Jul 20 '16 at 05:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117771/discussion-between-raman-sahasi-and-fatherazrael). — Raman Sahasi, Jul 20 '16 at 06:01

Guillaume Smet · Accepted Answer · 2016-07-20T06:56:17.540

1

Never use a tokenizer that does a tokenization for sorting. You need to use a KeywordTokenizer to be sure the tokens are kept as is.

Here is the analyzer we use for sorting at my former company:

    @AnalyzerDef(name = "TEXT_SORT",
        tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
                    @Parameter(name = "pattern", value = "('-&\\.,\\(\\))"),
                    @Parameter(name = "replacement", value = " "),
                    @Parameter(name = "replace", value = "all")
                }),
                @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
                    @Parameter(name = "pattern", value = "([^0-9\\p{L} ])"),
                    @Parameter(name = "replacement", value = ""),
                    @Parameter(name = "replace", value = "all")
                }),
                @TokenFilterDef(factory = TrimFilterFactory.class)
        }
    )

It's for the latest version of Hibernate Search so you need to adapt it. Obviously, you need a s/ASCIIFoldingFilterFactory/ISOLatin1AccentFilterFactory/ but I'm not sure if PatternReplaceFilterFactory already exists in 3.6.2.

edited Jul 20 '16 at 06:56

answered Jul 20 '16 at 06:41

Guillaume Smet

9,921
22
29

Thanks. How to replace norwegian characters in above example-> Æ, Ø, and Å? Here Ø with O and Å with A. But replacing may not be good criteria as after replace we need to get exact results in List? – fatherazrael Jul 20 '16 at 06:45
It's exactly what ASCIIFoldingFilterFactory (and ISOLatin1AccentFilterFactory in Lucene 3.6.2) does. See http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/3.6.0/org/apache/lucene/analysis/ISOLatin1AccentFilter.java#91 . – Guillaume Smet Jul 20 '16 at 06:50
But it is not working as expected as Ø is coming between A & B and it should be between M & P – fatherazrael Jul 20 '16 at 06:56
Can you share your updated code? (just the mapping part) Did you replace your StandardTokenizerFactory by KeywordTokenizerFactory as explained above? – Guillaume Smet Jul 20 '16 at 06:57
Shared in description. I am using the same snippet given by you. It is displaying results correctly expect Norwegian Characters. – fatherazrael Jul 20 '16 at 07:03
Did you reindex your data? – Guillaume Smet Jul 20 '16 at 07:09
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117777/discussion-between-fatherazrael-and-guillaume-smet). – fatherazrael Jul 20 '16 at 07:10
@Guilaume Smet This case insensitivity issue is resolved. But i am still facing issue in formatting Norwegian characters. Another question created for same. Could you provide your suggestions? http://stackoverflow.com/questions/39264308/how-to-do-case-insensitive-sorting-of-norwegian-characters-%C3%86-%C3%98-and-%C3%85-using-h – fatherazrael Sep 01 '16 at 06:49

How to do case insensitive sorting using Hibernate Lucene Search?

1 Answers1