Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

Question

If a Lucene ShingleFilter can be used to tokenize a string into shingles, or ngrams, of different sizes, e.g.:

"please divide this sentence into shingles"

Becomes:

shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles"

Does anyone know if this can be used in conjunction with other analyzers to return the frequencies of the bigrams or trigrams found, e.g.:

"please divide this please divide sentence into shingles"

Would return 2 for "please divide"?

I should add that my strings are built up from a database and then indexed by Lucene in memory and are not persisted. Use of other products like Solr is not intended.

score 0 · Accepted Answer · answered Sep 06 '12 at 23:48

0

I ended up passing my snowballAnalyzer and standardAnalyzers as parameters to ShingleFilterWrappers and processing the outputs via a TermVectorMapper.

answered Sep 06 '12 at 23:48

Mr Morgan

2,215
15
48
78

Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

1 Answers1