0

I am using the standard tokenizer in my elasticsearch plugin. I need to iterate each token of standard tokenizer and update with some encrypted text to the lucene index. Is there any way to update the tokens of standard tokenizer? Can anyone help?

Brisi
  • 1,781
  • 7
  • 26
  • 41
  • 1
    Can you provide a concrete example of what you're trying to achieve? – Val Aug 05 '20 at 14:36
  • @Val Updated question. Please take a look at it! – Brisi Aug 06 '20 at 06:11
  • 1
    That doesn't look concrete enough to me :-) Show a real text input and what you'd like to index. Also curious why it needs to be encrypted... how do you expect to search over encrypted data? – Val Aug 06 '20 at 06:12
  • You can have ingest pipeline which does encryption. But the question is that why do you need to decrypt everytime you read as @Val asked. – Gibbs Aug 06 '20 at 06:24
  • The question is why do you need to store/index those PII tokens at all since you won't be able to search on it anyway... it's a waste of space, so what's the goal? All you need to do is to scramble the PII bits in your source document but you don't need to index those at all in my opinion – Val Aug 06 '20 at 07:32

1 Answers1

1

Its an interesting use case, but tokenizer IMHO is not the correct place where it should be done, basically the elasticsearch analysis process consists of below three-phase.

  1. char filter
  2. tokenizer
  3. token filter

if you want to change some chars, before sending it to tokenizer do it in char filter or change the tokens in the token filter, as you can see in these phases you can do more transformation than in tokenizer phase.

Amit
  • 30,756
  • 6
  • 57
  • 88