How to use StandardTokenizer from lucene 5.x.x

Question

There are a lot of examples that show how to use the StandardTokenizer like this:

TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_36, new StringReader(input));

But in newer Lucene versions this constructor is unavailable. The new constructor looks like this:

StandardTokenizer(AttributeFactory factory)

What is the role of this AttributeFactory and how can i tokenize a String in newer versions of Lucene?

Have you tried passing the singleton `AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY`? Or use the parameter-less constructor? — sisve, Jun 24 '15 at 12:31
Hi Simon thanks for the tips. It does work with this code: Tokenizer tokenStream = new StandardTokenizer(); tokenStream.setReader(new StringReader(input)); You then get an TokenStream when you apply your filter. The TokenStream workflow has also changed, you now need to call reset(); Before you start consuming from this stream. — samy, Jun 26 '15 at 07:22

rusnyder · Accepted Answer · 2015-11-20T19:26:52.310

The AttributeFactory creates AttributeImpls which are sources for Attributes. Attributes govern the behavior of the TokenStream, which is the underlying mechanism used for reading/tracking the data stream for the StandardTokenizer.

Little has changed from 4.x to 5.x with respect to the AttributeFactory - in both versions, you can create a StandardTokenizer with an AttributeFactory if you'd like, or if you don't specify one, then AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY will ultimately end up being used.

The big difference was that you could also pass in a Reader for the input stream as part of the constructor. This means that in 4.x, you would have to create a new StreamTokenizer for each input stream you wanted to process, which would in turn have to re-initialize attributes from the AttributeFactory.

I'm no Lucene dev, but my guess is that this is just a restructure to encourage reuse of the attributes across the reading of multiple streams. If you take a look at the internals of the TokenStream and the default AttributesFactory implementation, there is a LOT of reflection involved with creating and setting attributes. If I had to guess, the StreamTokenizer constructor that takes a reader was just removed to encourage reuse of the tokenizer and its attributes because initialization of those attributes is relatively expensive.

EDIT

Adding a long-overdue example - sorry for not leading with this:

// Define your attribute factory (or use the default) - same between 4.x and 5.x
AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;

// Create the tokenizer and prepare it for reading
//  Lucene 4.x
StandardTokenizer tokenizer = 
        new StandardTokenizer(factory, new StringReader("Tokenize me!"));
tokenizer.reset();
//  Lucene 5.x
StandardTokenizer tokenizer = new StandardTokenizer(factory);
tokenizer.setReader(new StringReader("Tokenizer me!"));
tokenizer.reset();

// Then process tokens - same between 4.x and 5.x
// NOTE: Here I'm adding a single expected attribute to handle string tokens,
//  but you would probably want to do something more meaningful/elegant
CharTermAttribute attr = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) {
    // Grab the term
    String term = attr.toString();

    // Do something crazy...
}

I'd like to offer my bounty to @RustyBuckets, how do I to that? — Filip, Jul 02 '15 at 15:21
@Filip i guess its already too late, sry i only saw the answer today. RustyBuckets thanks for the explanation, maybe you can also add a simple code example on how to use the new StandardTokenizer i think it might help some people. I currently use it like that: Tokenizer source = new StandardTokenizer(); source.setReader(new StringReader(mytext)); — samy, Jul 02 '15 at 16:40
Yes plz, a snippet of code as an example would be really helpful — higuaro, Nov 18 '15 at 00:45

How to use StandardTokenizer from lucene 5.x.x

1 Answers1