3

i have to store a simple document like:

{content : "The cat is mine"}

but before indexing i should replace (tag) specific words, "cat" is one of them. The result (indexed document) should be:

{content : "The <pet>cat</pet> is mine"}

I read something about highlighting specific words (with specific html tag too) but highlighting is for queries.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#tags

I need this operation before indexing. Another problem is that i have many tags (around 100.000!) with their groups (pet, car, flowers etc).

At the moment, before saving the document, i do a query for each word of the text to find if it is a tag or not, if yes i add <group_name>word</group_name>

This is a very slower solution. Alternatives?

Dail
  • 4,622
  • 16
  • 74
  • 109
  • 1
    Just out of curiosity, why not do this transformation when the document is already returned? And if you need to know whether a word belongs to a certain category, you might try to use synonyms. https://www.elastic.co/guide/en/elasticsearch/guide/current/using-synonyms.html – Ashalynd Nov 29 '15 at 23:32
  • @Ashalynd synonyms dictionary sounds like a very good idea. does ElasticSearch apply the synonyms before index time right ? I should transform the word with its respective html tag format) – Dail Nov 29 '15 at 23:36
  • Are you sure that elasticsearch replaces word with its specifc synonym (main form) during the indexing? i think it is used for queries no? – Dail Nov 29 '15 at 23:40
  • When you mention "before indexing" you actually mean "at indexing time", right? i.e. when the field is analyzed into tokens and just before being stored – Val Nov 30 '15 at 04:03
  • @Val yes, exactly. So in this case elastocsearch will replace the word with the form i used in the synonymos dictionary right? – Dail Nov 30 '15 at 04:06
  • With synonyms, ES will (only) index the token `cat` instead of `cat`, but the `_source` would not be modified, i.e. when you retrieve your document, you'll still see `The cat is mine`. Is this what you want and may I ask what makes you think that you need this to happen at indexing time? – Val Nov 30 '15 at 04:16
  • 1
    You can also have a look at the [pattern replacement token filter](https://www.elastic.co/guide/en/elasticsearch/reference/2.1/analysis-pattern_replace-tokenfilter.html). – dtrv Nov 30 '15 at 06:51
  • 1
    You can let elasticsearch analyze your document during index time or during query time - the first makes queries faster, the second gives you more flexibility. If you really want to have the input modified _before_ indexing, than you have to do it without elasticsearch. Have a look at the Unix stream editor (sed) or Unix translate tool (tr). – dtrv Nov 30 '15 at 06:57

0 Answers0