9

Short version:

Does anyone knows if something happened with EdgeNGramFilterFactory for solr5? It used to work fine on solr 4, but I just upgraded to solr5 and the cores having this fields using this filter refuses to load ...

Long story:

This configuration used to work in solr4.10 (schema.xml):

<field name="NAME" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="PP" type="text_prefix" indexed="true" stored="false" required="false" multiValued="false"/>

<copyField source="NAME" dest="PP">

<fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldType>

And the documentation says I did it right (no clear mention if it is for solr4 or solr5).

However, when I am trying to add a collection using this configuration, it fails with the following message:

<lst name="failure">
<str>
   org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://localhost:8983/solr: Error CREATEing SolrCore 'test_collection': Unable to create core [test_collection] Caused by: Unknown parameters: {side=front}</str>
</lst>

I removed the side=front "unknown" parameter, started from scratch and it worked - meaning no more errors.

So, while it used to work for solr4 without any additional change, for solr5 it no longer works. Did something changed? Did I miss any doc regarding this filter? Any extra library I need to load to make this work?

And final, if the above is meant to be like this (bug/feature/whatever) - is there any workaround in order to have this "side-substring" indexing-functionality without me having to generate the values when I am adding docs to solr?

Update: with the "hacked" schema (i.e. without side=front), I indexed the documents and changed the PP field to be stored. when I searched, it looks like it indexes the entire value. For example, for NAME:ELEPHANT, I found PP:ELEPHANT ...

cheffe
  • 9,345
  • 2
  • 46
  • 57
dcg
  • 1,144
  • 1
  • 22
  • 38
  • 1
    I don't have an answer, but a couple of things to note. First, I've largely given up on those old wiki pages, as they aren't well maintained. Instead, I use the confluence-based wiki docs, which are always very current (sometimes too current), or I download the doc from the Solr site for a specific Solr version. The [new wiki doc on the edge n-gram filter](https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter) doesn't even mention the `side` attribute, so they may have phased it out. – frances Mar 02 '15 at 20:24
  • 1
    Also, I'm not sure why searching `NAME:ELEPHANT` and finding `PP:ELEPHANT` is concerning. When you store a field, it doesn't store the result of the analysis, but the value that you originally put in the field. It may index the value several different ways, but you can only see them by executing searches or by using the analysis tool in the Solr web interface. The _real_ test is whether you can search for `PP:ELE` or `PP:ELEPHAN` and find the elephant document. – frances Mar 02 '15 at 20:25
  • it's not a concern. I was expecting to have in the document for PP something like 'E', 'EL', 'ELE'. Also, yes, later on, I did the search 'PP:ELE' and found 'ELEPHANT' - but I had no real explanation on why. – dcg Mar 03 '15 at 09:21

1 Answers1

13

That attribute side has been removed in the context of LUCENE-3907 in Version 4.4. This filter now always behaves as if you gave in side="front". So you may just remove that attribute and are fine, since you are using it the "front-way".

As you can read in the conversation of the linked Lucene Issue

If you need reverse n-grams, you could always add a filter to do that afterwards. There is no need to have this as separate logic in this filter. We should split logic and keep filters as simple as possible.

And this is what has been done. The side attribute has been removed from the filter.

This has been done in Lucene, not directly in Solr. As Lucene is a Java-API it has been mentioned in the Java Doc of the filter

As of Lucene 4.4, this filter does not support EdgeNGramTokenFilter.Side.BACK (you can use ReverseStringFilter up-front and afterward to get the same behavior), handles supplementary characters correctly and does not update offsets anymore.

This may be the reason why you do not find a word about it in the Solr documentation. But this change has also been mentioned in Lucene's Change Log.

cheffe
  • 9,345
  • 2
  • 46
  • 57
  • Yeap, you're right. Most of the solr-doc I've read had a small mention on top saying it covers also "solr 5.something". So, going with the inertia, I took for granted the solr5 coverage. Lesson learned. Thanks! – dcg Mar 03 '15 at 09:18
  • @dcg, I was having the same issue and wanted to do both front and back n-gram but I wasn't sure how I would do that with the help of ReverseStringFilter, what should my schema.xml look like for that ? Thanks – 5_nd_5 Jun 13 '16 at 17:08
  • @5_nd_5 - I don't have the solr docs handy right now - but if I remember correctly, you can apply first a string-reverse and then apply the n-gram. – dcg Jun 16 '16 at 09:29