java lucene custom analyzer and tokenizer creating problem in termvector offsets?

Question

i got a problem about the lucene termvector offsets that is when i analyzed a field with my custom analyzer it will give the invalid offsets for termvector but it is fine with standard analyzer, here is my analyzer code

public class AttachmentNameAnalyzer extends Analyzer {
    private boolean stemmTokens;
    private String name;

    public AttachmentNameAnalyzer(boolean stemmTokens, String name) {
        super();
        this.stemmTokens    = stemmTokens;
        this.name           = name;
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream stream = new AttachmentNameTokenizer(reader);
        if (stemmTokens)
            stream = new SnowballFilter(stream, name);
        return stream;
    }

    @Override
    public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        TokenStream stream = (TokenStream) getPreviousTokenStream();

        if (stream == null) {
            stream = new AttachmentNameTokenizer(reader);
            if (stemmTokens)
                stream = new SnowballFilter(stream, name);
            setPreviousTokenStream(stream);
        } else if (stream instanceof Tokenizer) {
            ( (Tokenizer) stream ).reset(reader);
        }

        return stream;
    }
}

whats wrong with this "Help required"

Since this code does nothing related to term offsets, you should post one that does. E.g. your AttachmentNameTokenizer? — Earwin, Jun 09 '11 at 18:17
Looks totally innocent so far. More code, exact examples of input+output (with broken offsets) required :) I'd go to lucene user mailing list with that, though. — Earwin, Jun 10 '11 at 07:37
i just found that there is some problem with snowball filter not sure what exactly the problem is — Badr, Jun 10 '11 at 10:44

score 0 · Answer 1 · answered Jun 11 '11 at 15:21

Which version of Lucene are you using ? I'm looking at the super class code for 3x branch and behavior changes with each version.

You may want to check the code for public final boolean incrementToken() where offset is calculated.

I also see this:

/**
 * <p>
 * As of Lucene 3.1 the char based API ({@link #isTokenChar(char)} and
 * {@link #normalize(char)}) has been depreciated in favor of a Unicode 4.0
 * compatible int based API to support codepoints instead of UTF-16 code
 * units. Subclasses of {@link CharTokenizer} must not override the char based
 * methods if a {@link Version} >= 3.1 is passed to the constructor.
 * <p>
 * <p>
 * NOTE: This method will be marked <i>abstract</i> in Lucene 4.0.
 * </p>
 */

btw, you can rewrite the switch statement like

@Override
protected boolean isTokenChar(int c) {
    switch(c)
    {
        case ',': case '.':
        case '-': case '_':
        case ' ':
            return false;
        default:
            return true;
    }
}

score 0 · Accepted Answer · answered Jul 04 '11 at 08:00

the problem it with the analyzer as i posted the code for analyzer earlier, actually the token stream is need to be rest for every new entry of text that is to be tokenized.

 public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        TokenStream stream = (TokenStream) getPreviousTokenStream();

        if (stream == null) {
            stream = new AttachmentNameTokenizer(reader);
            if (stemmTokens)
                stream = new SnowballFilter(stream, name);
            setPreviousTokenStream(stream); // --------------->  problem was here 
        } else if (stream instanceof Tokenizer) {
            ( (Tokenizer) stream ).reset(reader); 
        }

        return stream;
    }

every time when i sets the previous token stream the next coming text field the has to be separately tokenized it always starts with end offset of last token stream that make the term vector offset wrong for new stream it now it works fine like this

ublic TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
            TokenStream stream = (TokenStream) getPreviousTokenStream();

            if (stream == null) {
                stream = new AttachmentNameTokenizer(reader);
                if (stemmTokens)
                    stream = new SnowballFilter(stream, name);
            } else if (stream instanceof Tokenizer) {
                ( (Tokenizer) stream ).reset(reader); 
            }

            return stream;
        }

java lucene custom analyzer and tokenizer creating problem in termvector offsets?

2 Answers2