1

I am using Lucene 4.6.1 libraries. I am trying to add the word - hip hop in my stopword exclusion list.

I can exclude it if its written as - hiphop (as one word) but when its written like hip hop (with space in between) i cannot exclude it.

Below is my exclusion list logic -

public static final CharArraySet STOP_SET_STEM = new CharArraySet(LUCENE_VERSION, Arrays.asList(

"hiphop","hip hop"

), false);

More details regarding my custom analyzer logic -

below is my customanalyzer logic -

public final class CustomWordsAnalyzer extends StopwordAnalyzerBase {
  private static final Version LUCENE_VERSION = Version.LUCENE_46;

  // Regex used to exclude non-alpha-numeric tokens
  private static final Pattern ALPHA_NUMERIC = Pattern.compile("^[a-z][a-z0-9_]+$");
  private static final Matcher MATCHER = ALPHA_NUMERIC.matcher("");

  public CustomWordsAnalyzer() {
    super(LUCENE_VERSION, ProTextWordLists.STOP_SET);
  }

  public CustomWordsAnalyzer(CharArraySet stopSet) {
    super(LUCENE_VERSION, stopSet);

  }

  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer tokenizer = new StandardTokenizer(LUCENE_VERSION, reader);
    TokenStream result = new StandardFilter(LUCENE_VERSION, tokenizer);
    result = new LowerCaseFilter(LUCENE_VERSION, result);
    result = new ASCIIFoldingFilter(result);
    result = new AlphaNumericMaxLengthFilter(result);
    result = new StopFilter(LUCENE_VERSION, result, ProTextWordLists.STOP_SET);

    result = new PorterStemFilter(result);
    result = new StopFilter(LUCENE_VERSION, result, ProTextWordLists.STOP_SET_STEM);
    return new TokenStreamComponents(tokenizer, result);
  }

  /**
   * Matches alpha-numeric tokens between 2 and 40 chars long.
   */
  static class AlphaNumericMaxLengthFilter extends TokenFilter {
    private final CharTermAttribute termAtt;
    private final char[] output = new char[28];

    AlphaNumericMaxLengthFilter(TokenStream in) {
      super(in);
      termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public final boolean incrementToken() throws IOException {
      // return the first alpha-numeric token between 2 and 40 length
      while (input.incrementToken()) {
        int length = termAtt.length();
        if (length >= 3 && length <= 28) {
          char[] buf = termAtt.buffer();
          int at = 0;
          for (int c = 0; c < length; c++) {
            char ch = buf[c];
            if (ch != '\'') {
              output[at++] = ch;
            }
          }
          String term = new String(output, 0, at);
          MATCHER.reset(term);
          if (MATCHER.matches() && !term.startsWith("a0")) {
            termAtt.setEmpty();
            termAtt.append(term);
            return true;
          }
        }
      }
      return false;
    }
  }
}
Mysterion
  • 9,050
  • 3
  • 30
  • 52
VP10
  • 127
  • 1
  • 2
  • 13

1 Answers1

0

It couldn't be done with default Lucene implementation, the only way to do that - is to create your own Analyzer or TokenStream or both, that will process data/query in the way you need it (e.g. filter phrases)

Mysterion
  • 9,050
  • 3
  • 30
  • 52
  • yea i have created my own analyzer but i am still unable to do so. i think i may be doing something wrong. – VP10 Jan 25 '15 at 22:51
  • yes, it could be, please show your analyzer code - put on pastebin or in gist – Mysterion Jan 25 '15 at 22:52
  • thanks mysterion! i just included my analyzer logic in the problem statement. any help would be appreciated. are you able to see it? – VP10 Jan 25 '15 at 23:11