So we have an old elasticsearch index that succumbed to field explosion. We have redesigned the structure of the index to fix this using nested documents. However, we are attempting to figure out how to migrate the old index data into the new structure. We are currently looking at using Logstash plugins, notably the aggregate plugin, to try to accomplish this. However, all the examples we can find show how to created the nested documents from database calls, as opposed to from a field-exploded index. For context, here is an example of what an old index might look like:
"assetID": 22074,
"metadata": {
"50": {
"analyzed": "Phase One",
"full": "Phase One",
"date": "0001-01-01T00:00:00"
},
"51": {
"analyzed": "H 25",
"full": "H 25",
"date": "0001-01-01T00:00:00"
},
"58": {
"analyzed": "50",
"full": "50",
"date": "0001-01-01T00:00:00"
}
}
And here is what we would like the transformed data to look like in the end:
"assetID": 22074,
"metadata": [{
"metadataId": 50,
"ngrams": "Phase One", //This was "analyzed"
"alphanumeric": "Phase One", //This was "full"
"date": "0001-01-01T00:00:00"
}, {
"metadataId": 51,
"ngrams": "H 25", //This was "analyzed"
"alphanumeric": "H 25", //This was "full"
"date": "0001-01-01T00:00:00"
}, {
"metadataId": 58,
"ngrams": "50", //This was "analyzed"
"alphanumeric": "50", //This was "full"
"date": "0001-01-01T00:00:00"
}
}]
As a dumbed-down example, here is what we can figure from the aggregate plugin:
input {
elasticsearch {
hosts => "my.old.host.name:9266"
index => "my-old-index"
query => '{"query": {"bool": {"must": [{"term": {"_id": "22074"}}]}}}'
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
aggregate {
task_id => "%{id}"
code => "
map['assetID'] = event.get('assetID')
map['metadata'] ||= []
map['metadata'] << {
metadataId => ? //somehow parse the Id out of the exploded field name "metadata.#.full",
ngrams => event.get('metadata.#.analyzed'),
alphanumeric => event.get('metadata.#.full'),
date => event.get('metadata.#.date'),
}
"
push_previous_map_as_event => true
timeout => 150000
timeout_tags => ['aggregated']
}
if "aggregated" not in [tags] {
drop {}
}
}
output {
elasticsearch {
hosts => "my.new.host:9266"
index => "my-new-index"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
action => "update"
}
file {
path => "C:\apps\logstash\logstash-5.6.6\testLog.log"
}
}
Obviously the above example is basically just pseudocode, but that is all we can gather from looking at the documentation for both Logstash and ElasticSearch, as well as the aggregate filter plugin and generally Googling things within an inch of their life.