Creating and using LuceneAnalysisDefinitionProvider with Hibernate Search

Question

When you search Stackoverflow or the Internet for LuceneAnalysisDefinitionProvider, you'll find hundreds of pages, each of them having the same code copied from another page without any decent explanation or further examples of usage.

So I tried to do it by myself and failed. Here is my code:

public class CustomLuceneAnalysisDefinitionProvider
        implements LuceneAnalysisDefinitionProvider {

  @Override
  public void register(final LuceneAnalysisDefinitionRegistryBuilder builder) {
    builder
      .analyzer("customAnalyzer")
        .tokenizer(StandardTokenizerFactory.class)
        .charFilter(MappingCharFilterFactory.class)
          .param("mapping",
            "org/hibernate/search/test/analyzer/mapping-chars.properties")
        .tokenFilter(ASCIIFoldingFilterFactory.class)
        .tokenFilter(LowerCaseFilterFactory.class)
        .tokenFilter(StopFilterFactory.class)
          // WRONG! It's not "mapping"!
//        .param("mapping",
//          "org/hibernate/search/test/analyzer/stoplist.properties")
          .param("words",
            "classpath:/stoplist.properties")
          .param("ignoreCase", "true");
  }

}

Now we have CustomLuceneAnalysisDefinitionProvider and what's next?

Where to put and how to address mapping-chars.properties when adding it as a parameter to MappingCharFilterFactory?
What is the contents of mapping-chars.properties and how to create mine of modify existing?
Where to put stoplist.properties and how to address it when adding as mapping parameter to StopFilterFactory?
How to add previously defined customAnalyzer to single @Field mentioned below?

@Field(
    index = Index.YES,
    analyze = Analyze.YES,
    store = Store.NO,
    bridge = @FieldBridge(impl = LocalizedFieldBridge.class)
)
private LocalizedField description;

On some pages I found option to put this definition into application.properties:

hibernate.search.lucene.analysis_definition_provider = com.thevegcat.app.search.CustomAnalysisDefinitionProvider

But I don't want to replace original analyzer, I just want to use custom analyzer for few specific properties.

EDIT#1

Looking into org.apache.lucene.analysis.core.StopFilterFactory line 86, one can notice it takes words as a key, not mapping.

EDIT#2

If you put your stop words file in src/main/resources, then you have to address it:

.param("words", "classpath:/stoplist.properties")

score 1 · Answer 1 · answered Jan 06 '23 at 14:41

you'll find hundreds of pages, each of them having the same code copied from another page without any decent explanation or further examples of usage.

Hibernate Search 5 had its problems, one of which was lack of documentation in some areas. Now that it's in maintenance mode, those problems are unlikely to get addressed.

There is some documentation for that feature in the Hibernate Search 5 documentation: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#section-programmatic-analyzer-definition

You'll get better documentation of that feature by migrating to Hibernate Search 6+.

That being said, most of your questions related to Lucene features, so you probably won't find answers in Hibernate Search's documentation. You could find them in Lucene's documentation. How to find such documentation is explained in the Hibernate Search 6 documentation:

To know more about the behavior of these character filters, tokenizers and token filters, either browse the Lucene Javadoc or read the corresponding section on the Solr Wiki (you don’t need Solr to use these analyzers, it’s just that there is no documentation page for Lucene proper).

Where to put and how to address mapping-chars.properties when adding it as a parameter to MappingCharFilterFactory?

In your classpath.

What is the contents of mapping-chars.properties and how to create mine of modify existing?

That's the kind of things that Lucene doesn't document, at least not clearly. Solr's documentation is better: https://solr.apache.org/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.MappingCharFilterFactory

Where to put stoplist.properties and how to address it when adding as mapping parameter to StopFilterFactory?

Put it in the classpath, and pass the path to that file from the root of your classpath.

How to add previously defined customAnalyzer to single @Field mentioned below?

Well that is documented, at least: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_referencing_named_analyzers

@Field(analyzer = @Analyzer(definition = "customAnalyzer"))

On some pages I found option to put this definition into application.properties:
hibernate.search.lucene.analysis_definition_provider = com.thevegcat.app.search.CustomAnalysisDefinitionProvider
But I don't want to replace original analyzer, I just want to use custom analyzer for few specific properties.

You won't replace an "analyzer", you will register an analysis definition provider. Which will add analyzer definitions to Hibernate Search, which can then be referenced from @Field. Setting an analysis definition provider does not, in itself, change your mapping in any way.

That was useful! But... now there is another problem. The file with stop words has few lines, in each line is "a", "e", "i", "o", "u", "milk" respectively. The product name is "Shhh Ths Is Not Milk". After index rebuild, I can find my product with "Shhh" only. If I use "milk", none is found which is a confirmation my stop words file is applied. But when I search by "not" or "this" - none is found too. Does it mean my stop words are merged with existing stop words? — horvoje, Jan 06 '23 at 15:01
In case it helps: Lucene may not document a feature, but you may find a test class which shows how the feature is used. For example, or the mapping file, used by `MappingCharFilterFactory`, there is [this example](https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/test/org/apache/lucene/analysis/custom/mapping1.txt) - where each line has the format `"x" => "y"`. — andrewJames, Jan 06 '23 at 15:07
You can see how this file is used [here](https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/test/org/apache/lucene/analysis/custom/TestCustomAnalyzer.java#L521) - but that is pure-Lucene not Hibernate. I don't know if Hibernate 5 would handle this for you in the same way, behind the scenes, using `new HashMap<>` — andrewJames, Jan 06 '23 at 15:07
@horvoje I don't know. Depending on how you build your query, a different analyzer might be applied and remove `not` and `this` from your query. Especially if you don't use Hibernate Search's DSL but use Lucene's `QueryParser` instead. Try calling `toString()` on your query to check that. Otherwise, create a new question with a complete reproducer so that people can try and debug this themselves. — yrodiere, Jan 06 '23 at 15:30

Creating and using LuceneAnalysisDefinitionProvider with Hibernate Search

1 Answers1