In short:
In schema.xml I want to declare an analyzer to break apart a field with the PatternTokenizer, and then I want to have those values to be processed by PathHierarchyTokenizer.
(Path Tokenizer breaks up paths like "a/b/c" into [a, a/b, a/b/c])
Longer Version of the Question:
My overall data is not CSV, but I've got one field that contains comma separated values; logically it's like a multi-value field, but it's passed in as just one delimited string.
Those individual values happen to be taxonomy paths with slash separators.
So a document might look like:
<doc>
<field name="id">12345</field>
<field name="title">This is the Title</field>
<field name="taxo_paths">A/B/C,D/E,F/G/H/I</field>
</doc>
First it should split the field taxo_paths into these tokens via PatternTokenizer pattern=","
:
- A/B/C
- D/E
- F/G/H/I
Then PathHierarchy should work its magic and turn them into:
- A
- A/B
- A/B/C
- D
- D/E
- F
- F/G
- F/G/H
- F/G/H/I
Path Hierarchy tokenizer is very cool!
Let's assume I don't have control over how the data comes in. And assume we don't want to use any custom Java filters or tokenizers. Also, I realize there's a subtly in PathHierarchyTokenizer in that it's actually creating synonyms by only setting the token offset to 1 for one of the tokens and rest to 0; let's assume I don't care about that either at the moment.