6

i got a problem about the lucene termvector offsets that is when i analyzed a field with my custom analyzer it will give the invalid offsets for termvector but it is fine with standard analyzer, here is my analyzer code

public class AttachmentNameAnalyzer extends Analyzer {
    private boolean stemmTokens;
    private String name;

    public AttachmentNameAnalyzer(boolean stemmTokens, String name) {
        super();
        this.stemmTokens    = stemmTokens;
        this.name           = name;
    }

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream stream = new AttachmentNameTokenizer(reader);
        if (stemmTokens)
            stream = new SnowballFilter(stream, name);
        return stream;
    }

    @Override
    public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        TokenStream stream = (TokenStream) getPreviousTokenStream();

        if (stream == null) {
            stream = new AttachmentNameTokenizer(reader);
            if (stemmTokens)
                stream = new SnowballFilter(stream, name);
            setPreviousTokenStream(stream);
        } else if (stream instanceof Tokenizer) {
            ( (Tokenizer) stream ).reset(reader);
        }

        return stream;
    }
}

whats wrong with this "Help required"

Badr
  • 10,384
  • 15
  • 70
  • 104

2 Answers2

0

Which version of Lucene are you using ? I'm looking at the super class code for 3x branch and behavior changes with each version.

You may want to check the code for public final boolean incrementToken() where offset is calculated.

I also see this:

/**
 * <p>
 * As of Lucene 3.1 the char based API ({@link #isTokenChar(char)} and
 * {@link #normalize(char)}) has been depreciated in favor of a Unicode 4.0
 * compatible int based API to support codepoints instead of UTF-16 code
 * units. Subclasses of {@link CharTokenizer} must not override the char based
 * methods if a {@link Version} >= 3.1 is passed to the constructor.
 * <p>
 * <p>
 * NOTE: This method will be marked <i>abstract</i> in Lucene 4.0.
 * </p>
 */

btw, you can rewrite the switch statement like

@Override
protected boolean isTokenChar(int c) {
    switch(c)
    {
        case ',': case '.':
        case '-': case '_':
        case ' ':
            return false;
        default:
            return true;
    }
}
c00kiemon5ter
  • 16,994
  • 7
  • 46
  • 48
0

the problem it with the analyzer as i posted the code for analyzer earlier, actually the token stream is need to be rest for every new entry of text that is to be tokenized.

 public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
        TokenStream stream = (TokenStream) getPreviousTokenStream();

        if (stream == null) {
            stream = new AttachmentNameTokenizer(reader);
            if (stemmTokens)
                stream = new SnowballFilter(stream, name);
            setPreviousTokenStream(stream); // --------------->  problem was here 
        } else if (stream instanceof Tokenizer) {
            ( (Tokenizer) stream ).reset(reader); 
        }

        return stream;
    }

every time when i sets the previous token stream the next coming text field the has to be separately tokenized it always starts with end offset of last token stream that make the term vector offset wrong for new stream it now it works fine like this

ublic TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
            TokenStream stream = (TokenStream) getPreviousTokenStream();

            if (stream == null) {
                stream = new AttachmentNameTokenizer(reader);
                if (stemmTokens)
                    stream = new SnowballFilter(stream, name);
            } else if (stream instanceof Tokenizer) {
                ( (Tokenizer) stream ).reset(reader); 
            }

            return stream;
        }
Badr
  • 10,384
  • 15
  • 70
  • 104