Solr: Combining PatternTokenizerFactory and PathHierarchyTokenizerFactory?

Question

In short:

In schema.xml I want to declare an analyzer to break apart a field with the PatternTokenizer, and then I want to have those values to be processed by PathHierarchyTokenizer.

(Path Tokenizer breaks up paths like "a/b/c" into [a, a/b, a/b/c])

Longer Version of the Question:

My overall data is not CSV, but I've got one field that contains comma separated values; logically it's like a multi-value field, but it's passed in as just one delimited string.

Those individual values happen to be taxonomy paths with slash separators.

So a document might look like:

<doc>
  <field name="id">12345</field>
  <field name="title">This is the Title</field>
  <field name="taxo_paths">A/B/C,D/E,F/G/H/I</field>
</doc>

First it should split the field taxo_paths into these tokens via PatternTokenizer pattern=",":

A/B/C
D/E
F/G/H/I

Then PathHierarchy should work its magic and turn them into:

A
A/B
A/B/C
D
D/E
F
F/G
F/G/H
F/G/H/I

Path Hierarchy tokenizer is very cool!

Let's assume I don't have control over how the data comes in. And assume we don't want to use any custom Java filters or tokenizers. Also, I realize there's a subtly in PathHierarchyTokenizer in that it's actually creating synonyms by only setting the token offset to 1 for one of the tokens and rest to 0; let's assume I don't care about that either at the moment.

So the main problem you are facing is that `solr.PatternTokenizerFactory` and `solr.PathHierarchyTokenizerFactory` are both tokenizers and you can specify only one tokenizer in an analyzer chain, right? — arun, Aug 02 '14 at 06:07
@arun yup, that's it. I really think both would be useful if packaged as filters, I could imagine scenarios for mixing and matching both with other logic. — Mark Bennett, Aug 02 '14 at 20:11

arun · Answer 1 · 2014-08-02T14:38:57.237

Here is one possible way to do it.

We have to give up one of the tokenizers, since an analyzer chain can have only one tokenizer. This solution gives up solr.PathHierarchyTokenizerFactory (sorry, I gave up your favorite tokenizer ;)).

Once we have the tokens split with solr.PatternTokenizerFactory on commas, we will use an edge N-gram filter followed by a Pattern Replace filter to remove the tokens ending with a forward slash and finally prune the zero-length tokens.

Here is the fieldType definition:

<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
      <tokenizer class="solr.PatternTokenizerFactory" 
                 pattern="," 
                 group="-1"/>
      <filter class="solr.EdgeNGramFilterFactory" 
              minGramSize="1" 
              maxGramSize="100" 
              side="front"/>
      <filter class="solr.PatternReplaceCharFilterFactory" 
              pattern="^.*/$" 
              replacement=""/>
      <filter class="solr.LengthFilterFactory" 
              min="1" 
              max="10"/>
  </analyzer>
</fieldType>

Here is what my Solr 4.2 analysis output looks like:

enter image description here

EDIT: This solution works only if the component terms in the taxonomy are single characters.

Oh geez, I should've specifically stated that they were variable length alphanumeric and spaces, and was just using "A", "B" and "C" as arbitrary labels, but thanks anyway! It annoys me that both were packaged only as tokenizers, it seems like either one might be useful as a token filter. Long term I'll probably take a stab at refactoring them and submitting as JIRA patches, but until/if they get packaged as part of the standard distro, it'll still be a hassle for easy usage/deployment; folding in custom patches or jars is more of a hassle than some coders realize for busy clients ;-) — Mark Bennett, Aug 02 '14 at 20:08
Try asking the solr user group. The folks there are very helpful (but not that active on stackoverflow). — arun, Aug 02 '14 at 23:46

score 0 · Answer 2 · answered Aug 03 '14 at 20:32

0

One way to do it would have been to split on coma before the analyzers chain, specifically in the UpdateRequestProcessor. Unfortunately, I am not aware of the URP doing splitting, only one that does joining.

answered Aug 03 '14 at 20:32

Alexandre Rafalovitch

9,709
1
24
27

I had been wondering the same thing. The CloneFieldUpdateProcessor is in the neighborhood but not quite it. – Mark Bennett Aug 04 '14 at 14:58

Solr: Combining PatternTokenizerFactory and PathHierarchyTokenizerFactory?

2 Answers2