-1

I found the Levenshtein edit distance algorithm (via the damerau-levenshtein gem) and I think it suits my purpose well enough.

This code compares every element to every other element in the array, adding the result of each comparison to a set of hashes which will be sorted by the :distance key.

When this code is in use, the data in the array is logs from java services, so large edit distances show me which logs are most unique compared to the rest.

Input data is in this form:
["Failed to process service event Error: 404 Not Found", "Failed to process service event Error: Resource not found in Storage service", "Throughput exceeded for table test-us-east-1-service-table."]

def get_edit_distances(arr)
  if arr.empty?
    return []
  end
  if arr.length == 1
    return [arr[0]]
  end
  dl = DamerauLevenshtein
  results = Set.new
  i = 0 #array position
  while i < arr.length
    j = i + 1 #element to compare arr[i] against

    while j < arr.length
      results.add({message: arr[i], distance: dl.distance(arr[i], arr[j], 1, 256)})

      #This is to make sure we have every element in the final results
      if j+1 == arr.length 
        results.add({message: arr[j], distance: dl.distance(arr[0], arr[j], 1, 256)})
        break
      end

      j += 1 #increment 
    end
    i += 1
  end
  final_results = results.to_a
  #sort in descending order by distance
  final_results.sort! {|a,b| b[:distance] <=> a[:distance]}
  #remove duplicates of messages now that everything is sorted
  final_results.uniq! {|m| m[:message]}
  #return array of messages
  final_results.map {|r| r[:message]}
end

The output of this code is an array of the messages, ordered by uniqueness:
["Throughput exceeded for table test-us-east-1-service-table.", "Failed to process service event Error: Resource not found in Storage service", "Failed to process service event Error: 404 Not Found"]

For an array of 928 elements (normally there will be ~10,000,000), I got an output of 11801 elements (there were multiple edit distances for a single result, the set prevented duplicate messages for the same distance).

Benchmark results for the whole loop:

                    user      system     total       real
Edit Dist Loop:  62.260000   0.110000  62.370000 ( 62.456783)

Question: Is there a better way to create a sorted array/set of unique elements, ordered by uniqueness?

1 Answers1

0

Hopefully I understood your original problem correctly, "sorting an array of log messages by uniqueness";find the logs that have the rarest occurrence.

if that is the case, try this:

def sort_by_uniqueness(arr)
  h = {}
  arr.each do |entry|
    a[entry] = 0 unless a.key?(entry)
    a[entry] += 1 
  end
  a.sort_by { |k, v| v }.map(&:first)
end
Eddy K
  • 216
  • 1
  • 7
  • I have revised my question so hopefully it is easier to read, I'm not sure what the code you provided does since there are some syntax errors in it. I believe: `def sort_by_uniqueness(arr) a = {} arr.each do |entry| a[entry] = 0 unless a.key?(entry) a[entry] += 1 end a.sort_by { |k, v| v }.map(&:first) end` is what you meant to submit, but the output is not any different from the input as far as I can see. – Scytherswings Jan 05 '16 at 15:19