Generating logic terms on tokenization with elasticsearch

Question

I would like to split the following record (the line of keywords is in one column of a database table) into logical terms for building a facet search:

Ballett, Fernsehen, Film, Sachbücher/Musik, Film, Theater/Theater, Ballett/Allgemeines, Nachschlagewerke, Theater, Bühnenbildner (Einz.), Deutsches Theatermuseum München, München; Museen, Stepanek, Siegfried, Deutsches Theatermuseum; Kategorien - Lexika & Nachschlagen - Brockhaus, Kinder- & Jugendbücher, Jugendbücher

The result should be:

Ballett
Fernsehen
Film
Sachbücher/Musik
Film
Theater/Theater
Ballett/Allgemeines
Nachschlagewerke
Theater
Bühnenbildner (Einz.)
Deutsches Theatermuseum München
München
Museen
Stepanek
Siegfried
Deutsches Theatermuseum
Kategorien
Lexika & Nachschlagen
Brockhaus
Kinder- & Jugendbücher
Jugendbücher

I've tried different things, but I did not find a solution how to split the long record on tokenization correctly. Is it possible with the Pattern Tokenizer?

Thanks for hints

What are your rules for creating that token list from the DB string? — ramseykhalaf, Sep 22 '13 at 20:07
I've tried: my_analyzer: type: pattern group: 0 pattern: '([-,])' — Stefan, Sep 22 '13 at 22:44
I mean in English, using words can you describe what are your rules for breaking the string up. As sometimes you are splitting on `-` but othertimes you are not, e.g.: `Kinder- & Jugendbücher`, so it is confusing. — ramseykhalaf, Sep 23 '13 at 11:01
At the moment I have no ruleset defined, because that is what I'm looking for. I have the long string as described in my question and I would like to split it like shown before. Should I give you an example with english words? — Stefan, Sep 23 '13 at 18:24
I don't think we're on the same page. I just wanted you to describe to me what the rules were, not in code, but "layman's speak"... Tell me how you got the list from the array (in the question), it is not clear to me. — ramseykhalaf, Sep 24 '13 at 10:03
The list comes as it is from a users database. It's one big memofield where the user can enter such content. I hope that is what you like to know. — Stefan, Sep 24 '13 at 14:57

Generating logic terms on tokenization with elasticsearch

0 Answers0