1

I have an index with documents in elasticsearch, each document has 480 fields. What I'm trying to do is to search for a term (e.g. "apple"), and get all the unique field names that their value matches the search term. So if my docs are:

{
  "field1": "123",
  "field2": "apple stock",
  "field3": "red apple",
},
{
  "field1": "apple",
  "field2": "apple stock",
  "field3": "green apple",
}

What I would like to get as a result of the query is an aggregation like this:

{
  "field1": ["apple"],
  "field2": ["apple stock"],
  "field3": ["red apple", "green apple"]
}

Since each doc has ~480 fields, I prefer to do a multi_match query and not a filter that has all the fields:

"query": {
        "multi_match": {
            "query": "apple",
            "type": "phrase"
        }
    }

Is this query possible in elasticsearch?

2 Answers2

0

Since you "don't know what you don't know", you'll probably have to resort to a scripted metric aggregation:

POST myindex/_search
{
  "size": 0,
  "query": {
    "multi_match": {
      "query": "apple",
      "type": "phrase"
    }
  },
  "aggs": {
    "fields_breakdown": {
      "scripted_metric": {
        "params": {
          "phrase": "apple"
        }, 
        "init_script": "state.key_map = [:];",
        "map_script": """
          for (def pair : params._source.entrySet()) {
            def val = pair.getValue();
            
            if (!(val instanceof String) || !val.toLowerCase().contains(params.phrase.toLowerCase())) {
              continue;
            }
            
            def key = pair.getKey();
            
            if (!state.key_map.containsKey(key)) {
              state.key_map[key] = [val];
            } else if (!state.key_map[key].contains(val)) {
              state.key_map[key].add(val);
            }
          }
        """,
        "combine_script": "return state",
        "reduce_script": "return states"
      }
    }
  }
}

This code won't scale well (depending on your index size) so use it with caution and perhaps in batches (either by using more restrictive queries or maybe by limiting the # of fields aggregated at once).

BTW I proposed a solution for filtering substring in term-like aggregations a while ago -- it may be relevant for you here.

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
0

I'm actually thinking to use highlight to get the values that match the query and then in code remove the duplicates.

{
    "query": {
        "multi_match": {
            "query": "apple",
            "type": "phrase"
        }
    },
    "highlight": {
        "pre_tags": [
            ""
        ],
        "post_tags": [
            ""
        ],
        "fields": {
            "*": {}
        }
    }
}
  • That's good. You can then use [elastic-dump](https://github.com/elasticsearch-dump/elasticsearch-dump) to download/stream everything and then post process it in the language of your choice. – Joe - GMapsBook.com Feb 15 '21 at 09:49