I would like to dynamically enrich an existing index based on the (weighted) term frequencies given in a second index.
Imagine I have one index with one field I want to analyze (field_of_interest
):
POST test/_doc/1
{
"field_of_interest": "The quick brown fox jumps over the lazy dog."
}
POST test/_doc/2
{
"field_of_interest": "The quick and the dead."
}
POST test/_doc/3
{
"field_of_interest": "The lazy quack was quick to quip."
}
POST test/_doc/4
{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! "
}
and a second one (scores
) with pairs of keywords and weights:
POST scores/_doc/1
{
"term": "quick",
"weight": 1
}
POST scores/_doc/2
{
"term": "brown",
"weight": 2
}
POST scores/_doc/3
{
"term": "lazy",
"weight": 3
}
POST scores/_doc/4
{
"term": "green",
"weight": 4
}
I would like to define and perform some kind of analysis, ingestion, transform, enrichment, re-indexing or whatever to dynamically add a new field points
to the first index that is the aggregation (sum) of the weighted number of occurrences of each of the search terms from the second index in the field_of_interest
in the first index. So after performing this operation, I would want a new index to look something like this (some fields omitted):
{
"_id":"1",
"_source":{
"field_of_interest": "The quick brown fox jumps over the lazy dog.",
"points": 6
}
},
{
"_id":"2",
"_source":{
"field_of_interest": "The quick and the dead.",
"points": 1
}
},
{
"_id":"3",
"_source":{
"field_of_interest": "The lazy quack was quick to quip.",
"points": 4
}
},
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"points": 9
}
}
If possible, it may even be interesting to get individual fields for each of the terms, listing the weighted sum of the occurrences, e.g.
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"quick": 3,
"brown": 0,
"lazy": 6,
"green": 0,
"points": 9
}
}
The question I now have is how to go about this in Elasticsearch. I am fairly new to Elastic, and there are many concepts that seem promising, but so far I have not been able to pinpoint even a partial solution.
I am on Elasticsearch 7.x (but would be open to move to 8.x) and want to do this via the API, i.e. without using Kibana.
I first thought of an _ingest
pipeline with an _enrich
policy, since I am kind of trying to add information from one index to another. But my understanding is that the matching does not allow for a query, so I don't see how this could work.
I also looked at _transform
, _update_by_query
, custom scoring, _term_vector
but to be honest, I am a bit lost.
I would appreciate any pointers whether what I want to do can be done with Elasticsearch (I assumed it would kind of be the perfect tool) and if so, which of the many different Elasticsearch concept would be most suitable for my use case.