-1

I'm trying to build a simple method to look at about 100 entries in a database for a last name and pull out all the ones that match above a specific percentage of letters. My current approach is:

  1. Pull all 100 entries from the database into an array
  2. Iterate through them while performing the following action
  3. Split the last name into an array of letters
  4. Subtract that array from another array that contains the letters for the name I am trying to match which leaves only the letters that weren't matched.
  5. Take the size of the result and divide by the original size of the array from step 3 to get a percentage.
  6. If the percentage is above a predefined threshold, push that database object into a results array.

This works, but I feel like there must be some cool ruby/regex/active record method of doing this more efficiently. I have googled quite a bit but can't find anything.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Nick D
  • 115
  • 8
  • 3
    have you tried using the Levenshtein method? https://rubygems.org/gems/levenshtein/versions/0.2.2 - I think it the more sophisticated version of what you are trying to do. – Todd Resudek Oct 17 '16 at 04:19
  • No. I did not know this existed. This is awesome, thank you. – Nick D Oct 17 '16 at 04:25
  • What DB are you using? – Pascal Oct 17 '16 at 06:07
  • @ToddResudek Levenstein is arguably the most sophisticated method. Damerau-Levenstein should give better results here. And [Jaro-Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) is even better. – Aleksei Matiushkin Oct 17 '16 at 06:58

1 Answers1

5

To comment on the merit of the measure you suggested would require speculation, which is out-of-bounds at SO. I therefore will merely demonstrate how you might implement your proposed approach.

Code

First define a helper method:

class Array
  def difference(other)
    h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
    reject { |e| h[e] > 0 && h[e] -= 1 }
  end
end

In short, if

a = [3,1,2,3,4,3,2,2,4]
b = [2,3,4,4,3,4]

then

a - b           #=> [1]

whereas

a.difference(b) #=> [1, 3, 2, 2]

This method is elaborated in my answer to this SO question. I've found so many uses for it that I've proposed it be added to the Ruby Core.

The following method produces a hash whose keys are the elements of names (strings) and whose values are the fractions of the letters in the target string that are contained in each string in names.

def target_fractions(names, target)
  target_arr = target.downcase.scan(/[a-z]/)
  target_size = target_arr.size
  names.each_with_object({}) do |s,h|
    s_arr = s.downcase.scan(/[a-z]/)
    target_remaining = target_arr.difference(s_arr)
    h[s] = (target_size-target_remaining.size)/target_size.to_f
  end
end

Example

target = "Jimmy S. Bond"

and the names you are comparing are given by

names = ["Jill Dandy", "Boomer Asad", "Josefine Simbad"]

then

target_fractions(names, target)
  #=> {"Jill Dandy"=>0.5, "Boomer Asad"=>0.5, "Josefine Simbad"=>0.8} 

Explanation

For the above values of names and target,

target_arr = target.downcase.scan(/[a-z]/)
  #=> ["j", "i", "m", "m", "y", "s", "b", "o", "n", "d"] 
target_size = target_arr.size
  #=> 10

Now consider

s = "Jill Dandy"
h = {}

then

s_arr = s.downcase.scan(/[a-z]/)
  #=> ["j", "i", "l", "l", "d", "a", "n", "d", "y"]
target_remaining = target_arr.difference(s_arr)
  #=> ["m", "m", "s", "b", "o"]

h[s] = (target_size-target_remaining.size)/target_size.to_f
  #=> (10-5)/10.0 => 0.5
h #=> {"Jill Dandy"=>0.5}

The calculations are similar for Boomer and Josefine.

Community
  • 1
  • 1
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100