I found the Levenshtein
edit distance algorithm (via the damerau-levenshtein
gem) and I think it suits my purpose well enough.
This code compares every element to every other element in the array, adding the result of each comparison to a set of hashes which will be sorted by the :distance
key.
When this code is in use, the data in the array is logs from java services, so large edit distances show me which logs are most unique compared to the rest.
Input data is in this form:
["Failed to process service event Error: 404 Not Found", "Failed to process service event Error: Resource not found in Storage service", "Throughput exceeded for table test-us-east-1-service-table."]
def get_edit_distances(arr)
if arr.empty?
return []
end
if arr.length == 1
return [arr[0]]
end
dl = DamerauLevenshtein
results = Set.new
i = 0 #array position
while i < arr.length
j = i + 1 #element to compare arr[i] against
while j < arr.length
results.add({message: arr[i], distance: dl.distance(arr[i], arr[j], 1, 256)})
#This is to make sure we have every element in the final results
if j+1 == arr.length
results.add({message: arr[j], distance: dl.distance(arr[0], arr[j], 1, 256)})
break
end
j += 1 #increment
end
i += 1
end
final_results = results.to_a
#sort in descending order by distance
final_results.sort! {|a,b| b[:distance] <=> a[:distance]}
#remove duplicates of messages now that everything is sorted
final_results.uniq! {|m| m[:message]}
#return array of messages
final_results.map {|r| r[:message]}
end
The output of this code is an array of the messages, ordered by uniqueness:
["Throughput exceeded for table test-us-east-1-service-table.", "Failed to process service event Error: Resource not found in Storage service", "Failed to process service event Error: 404 Not Found"]
For an array of 928
elements (normally there will be ~10,000,000), I got an output of 11801
elements (there were multiple edit distances for a single result, the set
prevented duplicate messages for the same distance).
Benchmark results for the whole loop:
user system total real
Edit Dist Loop: 62.260000 0.110000 62.370000 ( 62.456783)
Question: Is there a better way to create a sorted array/set of unique elements, ordered by uniqueness?