0

I would like to dynamically enrich an existing index based on the (weighted) term frequencies given in a second index.

Imagine I have one index with one field I want to analyze (field_of_interest):

POST test/_doc/1
{
  "field_of_interest": "The quick brown fox jumps over the lazy dog."
}
POST test/_doc/2
{
  "field_of_interest": "The quick and the dead."
}
POST test/_doc/3
{
  "field_of_interest": "The lazy quack was quick to quip."
}
POST test/_doc/4
{
  "field_of_interest": "Quick, quick, quick, you lazy, lazy guys! "
}

and a second one (scores) with pairs of keywords and weights:

POST scores/_doc/1
{
  "term": "quick",
  "weight": 1
}
POST scores/_doc/2
{
  "term": "brown",
  "weight": 2
}
POST scores/_doc/3
{
  "term": "lazy",
  "weight": 3
}
POST scores/_doc/4
{
  "term": "green",
  "weight": 4
}

I would like to define and perform some kind of analysis, ingestion, transform, enrichment, re-indexing or whatever to dynamically add a new field points to the first index that is the aggregation (sum) of the weighted number of occurrences of each of the search terms from the second index in the field_of_interest in the first index. So after performing this operation, I would want a new index to look something like this (some fields omitted):

{
  "_id":"1",
  "_source":{
    "field_of_interest": "The quick brown fox jumps over the lazy dog.",
    "points": 6
  }
},
{
  "_id":"2",
  "_source":{
    "field_of_interest": "The quick and the dead.",
    "points": 1
  }
},
{
  "_id":"3",
  "_source":{
    "field_of_interest": "The lazy quack was quick to quip.",
    "points": 4
  }
},
{
  "_id":"4",
  "_source":{
    "field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
    "points": 9
  }
}

If possible, it may even be interesting to get individual fields for each of the terms, listing the weighted sum of the occurrences, e.g.

{
  "_id":"4",
  "_source":{
    "field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
    "quick": 3,
    "brown": 0,
    "lazy": 6,
    "green": 0,
    "points": 9
  }
}

The question I now have is how to go about this in Elasticsearch. I am fairly new to Elastic, and there are many concepts that seem promising, but so far I have not been able to pinpoint even a partial solution.

I am on Elasticsearch 7.x (but would be open to move to 8.x) and want to do this via the API, i.e. without using Kibana.

I first thought of an _ingest pipeline with an _enrich policy, since I am kind of trying to add information from one index to another. But my understanding is that the matching does not allow for a query, so I don't see how this could work.

I also looked at _transform, _update_by_query, custom scoring, _term_vector but to be honest, I am a bit lost.

I would appreciate any pointers whether what I want to do can be done with Elasticsearch (I assumed it would kind of be the perfect tool) and if so, which of the many different Elasticsearch concept would be most suitable for my use case.

Paulo
  • 8,690
  • 5
  • 20
  • 34
buddemat
  • 4,552
  • 14
  • 29
  • 49

1 Answers1

0

Follow this sequence of steps:

  1. /_scroll every document in the second index.
  2. Search for it in the first index (simple match query)
  3. Increment the points by a script update operation on every matching document.

Having individual words as fields in the first index is not a good idea. We do not know which words are going to be found inside the sentences, and so your index mapping will explode witha lot of dynamic fields, which is not desirable. A better way is to add a nested mapping to the first index. With the following mapping:

{
  "words" : {
      "type" : "nested",
      "properties" : {
            "name" : {"type" : "keyword"},
            "weight" : {"type" : "float"}
      }
  }
}

THen you simply append to this array, for every word that is found. "points" can be a seperate field.

What you want to do has to be done client side. There is no inbuilt way to handle such an operation.

HTH.

brugia
  • 473
  • 3
  • 6