0

So we have an old elasticsearch index that succumbed to field explosion. We have redesigned the structure of the index to fix this using nested documents. However, we are attempting to figure out how to migrate the old index data into the new structure. We are currently looking at using Logstash plugins, notably the aggregate plugin, to try to accomplish this. However, all the examples we can find show how to created the nested documents from database calls, as opposed to from a field-exploded index. For context, here is an example of what an old index might look like:

"assetID": 22074,
"metadata": {
  "50": {
    "analyzed": "Phase One",
    "full": "Phase One",
    "date": "0001-01-01T00:00:00"
  },
  "51": {
    "analyzed": "H 25",
    "full": "H 25",
    "date": "0001-01-01T00:00:00"
  },
  "58": {
    "analyzed": "50",
    "full": "50",
    "date": "0001-01-01T00:00:00"
  }
}

And here is what we would like the transformed data to look like in the end:

"assetID": 22074,
"metadata": [{
    "metadataId": 50,
    "ngrams": "Phase One", //This was "analyzed"
    "alphanumeric": "Phase One", //This was "full"
    "date": "0001-01-01T00:00:00"
  }, {
    "metadataId": 51,
    "ngrams": "H 25", //This was "analyzed"
    "alphanumeric": "H 25", //This was "full"
    "date": "0001-01-01T00:00:00"
  }, {
    "metadataId": 58,
    "ngrams": "50", //This was "analyzed"
    "alphanumeric": "50", //This was "full"
    "date": "0001-01-01T00:00:00"
  }
}]

As a dumbed-down example, here is what we can figure from the aggregate plugin:

input {
  elasticsearch {
    hosts => "my.old.host.name:9266"
    index => "my-old-index"
    query => '{"query": {"bool": {"must": [{"term": {"_id": "22074"}}]}}}'  
    size => 500
    scroll => "5m"
    docinfo => true
  }
}

filter {
   aggregate {
    task_id => "%{id}"

    code => "     
      map['assetID'] = event.get('assetID')
      map['metadata'] ||= []
      map['metadata'] << {
        metadataId => ? //somehow parse the Id out of the exploded field name "metadata.#.full",
        ngrams => event.get('metadata.#.analyzed'),
        alphanumeric => event.get('metadata.#.full'),
        date => event.get('metadata.#.date'),
      }
    "
    push_previous_map_as_event => true
    timeout => 150000
    timeout_tags => ['aggregated']    
  } 

   if "aggregated" not in [tags] {
    drop {}
  }

}

output {
  elasticsearch {
    hosts => "my.new.host:9266"
    index => "my-new-index"
    document_type => "%{[@metadata][_type]}"
    document_id => "%{[@metadata][_id]}"
    action => "update"
  }

  file {
    path => "C:\apps\logstash\logstash-5.6.6\testLog.log"
  }  
}

Obviously the above example is basically just pseudocode, but that is all we can gather from looking at the documentation for both Logstash and ElasticSearch, as well as the aggregate filter plugin and generally Googling things within an inch of their life.

cidthecoatrack
  • 1,441
  • 2
  • 18
  • 32

2 Answers2

0

You can play around with the event object, massage it and then add it into the new index. Something like below (The logstash code is untested, you may find some errors. Check the working ruby code after this section):

 aggregate {
    task_id => "%{id}"

    code => "arr = Array.new()

map["assetID"] = event.get("assetID")

metadataObj = event.get("metadata")
metadataObj.to_hash.each do |key,value| 
  transformedMetadata = {} 
  transformedMetadata["metadataId"] = key  

  value.to_hash.each do |k , v|

    if k == "analyzed" then
       transformedMetadata["ngrams"] = v
    elsif k == "full" then
       transformedMetadata["alphanumeric"] = v
    else
       transformedMetadata["date"] = v
    end
  end
  arr.push(transformedMetadata)
end
  map['metadata'] ||= []
  map['metadata'] << arr

"

  }
}

try to play around with above based on the event input and you will get there. Here's a working example, with the input you have in the question, for you to play around : https://repl.it/repls/HarshIntelligentEagle

json_data = {"assetID": 22074,
"metadata": {
  "50": {
    "analyzed": "Phase One",
    "full": "Phase One",
    "date": "0001-01-01T00:00:00"
  },
  "51": {
    "analyzed": "H 25",
    "full": "H 25",
    "date": "0001-01-01T00:00:00"
  },
  "58": {
    "analyzed": "50",
    "full": "50",
    "date": "0001-01-01T00:00:00"
  }
}
}

arr = Array.new()
transformedObj = {}
transformedObj["assetID"] = json_data[:assetID]


json_data[:metadata].to_hash.each do |key,value|  
  transformedMetadata = {}
  transformedMetadata["metadataId"] = key  
  
  value.to_hash.each do |k , v|
  
    if k == :analyzed then
       transformedMetadata["ngrams"] = v
    elsif k == :full then
       transformedMetadata["alphanumeric"] = v
    else
       transformedMetadata["date"] = v
    end
  end
  arr.push(transformedMetadata)
end
transformedObj["metadata"] = arr

puts transformedObj
Polynomial Proton
  • 5,020
  • 20
  • 37
0

In the end, we used ruby code to solve it in a script:

# Must use the input plugin for elasticsearch at version 4.0.2, or it cannot contact a 1.X index
input {
  elasticsearch {
    hosts => "my.old.host.name:9266"
    index => "my-old-index"
    query => '{
      "query": {
        "bool": {
          "must": [
            { "match_all": { } }
          ]
        }
      }
    }' 
    size => 500
    scroll => "5m"
    docinfo => true
  }
}

filter {
  mutate {
    remove_field => ['@version', '@timestamp']
  }
}

#metadata
filter {
  mutate {
    rename => { "[metadata]" => "[metadata_OLD]" }
  }

  ruby {
    code => "
      metadataDocs = []
      metadataFields = event.get('metadata_OLD')

      metadataFields.each { |key, value|
        metadataDoc = {
          'metadataID' => key.to_i,
          'date' => value['date']
        }

        if !value['full'].nil?
          metadataDoc[:alphanumeric] = value['full']
        end

        if !value['analyzed'].nil?
          metadataDoc[:ngrams] = value['analyzed']
        end

        metadataDocs << metadataDoc
      }

      event.set('metadata', metadataDocs)
    "
  }

  mutate {
    remove_field => ['metadata_OLD']
  }
}

output {
  elasticsearch {
    hosts => "my.new.host:9266"
    index => "my-new-index"
    document_type => "searchasset"
    document_id => "%{assetID}"
    action => "update"
    doc_as_upsert => true
  }
  file {
    path => "F:\logstash-6.1.2\logs\esMigration.log"
  }  
}
cidthecoatrack
  • 1,441
  • 2
  • 18
  • 32