0

I have an Elasticsearch v2.4.1 index in which I store values from a JSON feed. Sometimes I get values separated by spaces in some fields, like:

"titulo" : "E l a ñ o q u e e l m e r c a d o d e j ó d e a s u s t a r"

This happens around 15% of the time and prevents queries such as:

localhost:9200/indice/_search?q=titulo:mercado

To match the document above.

I think the problem could be solved by using some sort of CharFilter, I thought of the N-gram filter but that does the opposite. I know this might be complex since ES should, at some level, infer the language (or maybe I could specify it); deal with ambiguities and so on...

Another examples of the same:

"title" : "El g a l a r d ó n se e n t r e g a r á el p r ó x i m o día 2 4"

"title" : "G a m a a c t u a l i z a d a d e b o m b a s d e calor A q u a t e r m i c"

"title" : "K a s p e r s k y : m á s q u e a n t i v i r u s"
  • May I ask why those spaces are landing in your title in the first place? – Val Nov 19 '16 at 05:31
  • Those come from the JSON feed. It is the result of a bulk pdf text extraction or an OCR process. – Victor Duran Nov 20 '16 at 00:03
  • How do you distinguish where a word starts? If you want to remove all whitespace and use nGrams look here: http://stackoverflow.com/questions/29873344/elasticsearch-pattern-replace-replacing-whitespaces-while-analyzing - However the problems cause is your jsonfeed so I would try fixing the cause. – Dennis Ich Nov 21 '16 at 11:15

0 Answers0