14

There are a lot of examples that show how to use the StandardTokenizer like this:

TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_36, new StringReader(input));

But in newer Lucene versions this constructor is unavailable. The new constructor looks like this:

StandardTokenizer(AttributeFactory factory)

What is the role of this AttributeFactory and how can i tokenize a String in newer versions of Lucene?

rusnyder
  • 821
  • 7
  • 16
samy
  • 1,396
  • 2
  • 19
  • 41
  • Have you tried passing the singleton `AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY`? Or use the parameter-less constructor? – sisve Jun 24 '15 at 12:31
  • 1
    Hi Simon thanks for the tips. It does work with this code: Tokenizer tokenStream = new StandardTokenizer(); tokenStream.setReader(new StringReader(input)); You then get an TokenStream when you apply your filter. The TokenStream workflow has also changed, you now need to call reset(); Before you start consuming from this stream. – samy Jun 26 '15 at 07:22
  • Is this question answered then? – soulcheck Jun 27 '15 at 01:25

1 Answers1

21

The AttributeFactory creates AttributeImpls which are sources for Attributes. Attributes govern the behavior of the TokenStream, which is the underlying mechanism used for reading/tracking the data stream for the StandardTokenizer.

Little has changed from 4.x to 5.x with respect to the AttributeFactory - in both versions, you can create a StandardTokenizer with an AttributeFactory if you'd like, or if you don't specify one, then AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY will ultimately end up being used.

The big difference was that you could also pass in a Reader for the input stream as part of the constructor. This means that in 4.x, you would have to create a new StreamTokenizer for each input stream you wanted to process, which would in turn have to re-initialize attributes from the AttributeFactory.

I'm no Lucene dev, but my guess is that this is just a restructure to encourage reuse of the attributes across the reading of multiple streams. If you take a look at the internals of the TokenStream and the default AttributesFactory implementation, there is a LOT of reflection involved with creating and setting attributes. If I had to guess, the StreamTokenizer constructor that takes a reader was just removed to encourage reuse of the tokenizer and its attributes because initialization of those attributes is relatively expensive.

EDIT

Adding a long-overdue example - sorry for not leading with this:

// Define your attribute factory (or use the default) - same between 4.x and 5.x
AttributeFactory factory = AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY;

// Create the tokenizer and prepare it for reading
//  Lucene 4.x
StandardTokenizer tokenizer = 
        new StandardTokenizer(factory, new StringReader("Tokenize me!"));
tokenizer.reset();
//  Lucene 5.x
StandardTokenizer tokenizer = new StandardTokenizer(factory);
tokenizer.setReader(new StringReader("Tokenizer me!"));
tokenizer.reset();

// Then process tokens - same between 4.x and 5.x
// NOTE: Here I'm adding a single expected attribute to handle string tokens,
//  but you would probably want to do something more meaningful/elegant
CharTermAttribute attr = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) {
    // Grab the term
    String term = attr.toString();

    // Do something crazy...
}
rusnyder
  • 821
  • 7
  • 16
  • I'd like to offer my bounty to @RustyBuckets, how do I to that? – Filip Jul 02 '15 at 15:21
  • @Filip i guess its already too late, sry i only saw the answer today. RustyBuckets thanks for the explanation, maybe you can also add a simple code example on how to use the new StandardTokenizer i think it might help some people. I currently use it like that: Tokenizer source = new StandardTokenizer(); source.setReader(new StringReader(mytext)); – samy Jul 02 '15 at 16:40
  • Yes plz, a snippet of code as an example would be really helpful – higuaro Nov 18 '15 at 00:45