Hibernate search on prefixes

Question

Right now, I have successfully configured a basic Hibernate Search index to be able to search for full words on various fields of my JPA entity:

@Entity
@Indexed
class Talk {
    @Field String title
    @Field String summary
}

And my query looks something like this:

List<Talk> search(String text) {
    FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager)
    QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Talk).get()
    Query query = queryBuilder
            .keyword()
            .onFields("title", "summary")
            .matching(text)
            .createQuery()
    FullTextQuery jpaQuery = fullTextEntityManager.createFullTextQuery(query, Talk)
    return jpaQuery.getResultList()
}

Now I would like to fine-tune this setup so that when I search for "test" it still finds talks where title or summary contains "test" even as the prefix of another word. So talks titled "unit testing", or whose summary contains "testicle" should still appear in the search results, not just talks whose title or summary contains "test" as a full word.

I've tried to look at the documentation, but I can't figure out if I should change something to the way my entity is indexed, or whether it has something to do with the query. Note that I wanted to do something like the following, but then it's hard to search on several fields:

 Query query = queryBuilder
            .keyword().wildcard()
            .onField("title")
            .matching(text + "*")
            .createQuery()

EDIT: Based on Hardy's answer, I configured my entity like so:

@Indexed
@Entity
@AnalyzerDefs([
@AnalyzerDef(name = "ngram",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = [
            @TokenFilterDef(factory = LowerCaseFilterFactory.class),
            @TokenFilterDef(factory = NGramFilterFactory.class,
                    params = [
                        @Parameter(name = "minGramSize",value = "3"),
                        @Parameter(name = "maxGramSize",value = "3")
                    ])
        ])
])
class Talk {
    @Field(analyzer=@Analyzer(definition="ngram")) String title
    @Field(analyzer=@Analyzer(definition="ngram")) String summary
}

Thanks to that configuration, when I search for 'arti', I get Talks where title or summary contains words whose 'arti' is a subword of (artist, artisanal, etc.). Unfortunately, after those I also get Talks where title or summary contain words that contains subwords of my search term (arts, fart, etc.). There's probably some fine-tuning to eliminate those, but at least I get results sooner now, and they are in a sensible order.

score 3 · Accepted Answer · answered Mar 22 '16 at 19:36

There are multiple things you can do here. A lot can be done via the proper analyzing during index time.

For example, you want to apply a stemmer appropriate for your language. For English this is generally the Snowball stemmer.The idea is that during indexing all words are reduced to their stem, testing and tested to _test for example. This gets you a bit along your way.

The other thing you can look into is ngramm indexing. According to your description you want to find matching in unrelated words as well. The idea here is to index "subwords" of each words, so that they later can be found.

Regarding analyzers you want to look at the named analyzerssection of the Hibernate Search docs. The key here is the @AnalyzerDef annotation.

On the query side you can also apply some "tricks". Indeed you can use wildcard queries, however, if you are using the Hibernate Search query DSL, you cannot use a keyword query, but you need to use a wildcard query. Again, check the Hibernate Search docs.

I configured the Ngram analyzer and it works better. Unfortunately it seems to also load sub-words of the search term itself. So if I search for 'arti', it shows me results that contain 'arti' as a sub-word, and then results containing 'art' and 'rti'. So I get too many results now, but at least they are in a sensible order. — Sebastien, Mar 26 '16 at 22:27

score 1 · Answer 2 · edited Dec 12 '16 at 21:58

You should use Ngram or EdgeNGram Filter for indexin as you correctly noted in your answer. But you should use different analyzer for your queries as suggested in lucene documentation (see search_analyzer): https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html

This way your search query wouldn't be tokenized to ngrams and your results would be more like %text% or text% in SQL.

Unfortunately for unknown reasons Hibernate Search currently doesn't support search_analyzer specification on fields. You can only specific analyzer for indexing, which would be also used for search query analysis.

I plan to implement this functionality myself.

EDIT:

You can specify search-time analyzer (search_analyzer) like this:

List<Talk> search(String text) {
    FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager)
    EntityContext entityContext = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Talk);

    entityContext.overridesForField("myField", "myNamedAnalyzerDef");

    QueryBuilder queryBuilder = ec.get()
    Query query = queryBuilder
            .keyword()
            .onFields("title", "summary")
            .matching(text)
            .createQuery()
    FullTextQuery jpaQuery = fullTextEntityManager.createFullTextQuery(query, Talk)
    return jpaQuery.getResultList()
}

I have used this technique to effectively simulate Lucene search_analyzer property.

K.Nicholas · Answer 3 · 2016-03-24T02:06:17.130

In Lucene version 4.9 I used the EnglishAnalyzer for this. I think it is a English only implementation of the SnowballAnalyzer, but not 100% certain. I used it for both creating and searching the indexes. There is nothing special needed to use it.

Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_4_9);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);

and

analyzer = new EnglishAnalyzer(Version.LUCENE_4_9);
parser = new StandardQueryParser(analyzer);

You can see it in action at Guided Code Search. This runs exclusively off Lucene.

Lucene can be integrated into Hibernate searches, but I haven't yet tried to do that myself. I seems like it would be powerful, but I don't know: See Apache Lucene™ Integration.

I've also read that lucene can be patched into SQL engines, but I haven't tried that either. Example: Indexing Databases with Lucene.

Hibernate search on prefixes

3 Answers3