I have a simple string 7/f and the standard tokenizer ignores / and generates term vectors as 7 and f. I wish to have 7/f as one keyword, I want to build upon the StandardTokenizer but trying to modify that code is more complex.
Asked
Active
Viewed 104 times
0
-
The easy answer is to use KeywordAnalyzer but it will create a single token for a field value like "7/f + 8/f". If you want terms like "7/f" and "8/f" in separate tokens then you may need to roll your own tokenizer. – groverboy Dec 04 '13 at 15:53
-
[`WhitespaceTokenizer`](http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html) should do nicely in that case. `StandardTokenizer` is great at handling *full text*, but if it doesn't tokenize the format your data is in well, there is no reason to stick with it. – femtoRgon Dec 04 '13 at 17:13
-
I tried using both but I am not getting the desired result. My intention here is to try and identify all the elements (word boundaries) in the string I am indexing. So typically If I am using a string like 7/f INT STREET ROSE ROAD, and I use a Standard Analyzer I get the tokens generated. as 7 f ROSE ROAD INT STREET The special character is missing. and If I have a similar string but with only different character in between. 7^F INT STREET ROSE ROAD I get the same token stream. and when I do the search both will get returned with same score, as they both have 7 and F. – Saurabh Sharma Dec 04 '13 at 17:50
-
I tried it TokenStream ts = new KeywordAnalyzer(Version.LUCENE_45, new StringReader(_field_)); TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_45, new StringReader(_read_value_)); FieldType fldType1 = new FieldType(); fldType1.setIndexed(true); fldType1.setStoreTermVectors(true); fldType.setStored(true); Field ttf = new Field(INDEXED_FIELD_1, ts, fldType1); document.add(ttf); But I am not getting the right tokens. Moreover I wanted to enjoy the basic offering of StandardTokenizer and built on top of it. – Saurabh Sharma Dec 04 '13 at 17:59
-
[This question](http://stackoverflow.com/questions/15016473/how-to-customize-lucene-net-to-search-for-words-with-symbols-without-case-sensit/15022547#15022547) might help you. – groverboy Dec 05 '13 at 01:59
-
thanks @groverboy. I believe I have got it wrong,I need to define my own analyzer and tokenizer just to make sure right tokens are generated. I guess I will try and use the whitespacetokenizer for the field I need the term vectors for. – Saurabh Sharma Dec 05 '13 at 06:21
-
Thanks @groverboy! I ended up writing my own code and solved the problem! – Saurabh Sharma May 22 '15 at 07:34