0

I'm working on some entity matching problem where I have to check if the records reference to the same business entity or not, Look at the below two records separated by pipes, Now the words on both side of the pipes refer to same entity, 1st record have Fairvill common and second record has walmart 901 common. Is there any string matching function which can perform such kind of comparison.

I tried soundex and fuzzywuzzy in python but results are not that promosing, Any help much appreciated.

FAIRVILLE NY DPS 7026||WALMART SFAIRVILLUTUSA
WALMART DEPOT 901||PRICEWALMART SLC DRY A0901
min2bro
  • 4,509
  • 5
  • 29
  • 55
  • I don't see a "grantsvill" in the first record. Also, this sounds like a problem that could be solved using a regex matcher. – Muntaser Ahmed Jan 16 '18 at 05:07
  • question updated, it's fairville, regex won't help here since there are spell errors also like in the second text char 'e' and 's' is missing from fairsville – min2bro Jan 16 '18 at 05:20
  • you may want to look into things like "edit distance", and other string distance algorithms – njzk2 Jan 16 '18 at 05:24
  • I already tried levenshtein distance, fuzzy-wuzzy and soundex, results are not promising much – min2bro Jan 16 '18 at 05:30

1 Answers1

0

reference

def fit(self, sentence_pairs):
    """ Estimate of missing probability for each symbol
    Parameters:
        sentence_pairs - list of (original phrase, abbreviation)
    In the abbreviation, all missed symbols are replaced with "-"
    """
    self.missed_counter_ = defaultdict(lambda: Counter())
    self.total_counter_ = defaultdict(lambda: Counter())
    for (original, observed) in sentence_pairs:
        for i, (original_letter, observed_letter) \
                in enumerate(zip(original[self.order:], observed[self.order:])):
            context = original[i:(i+self.order)]
            if observed_letter == '-':
                self.missed_counter_[context][original_letter] += 1
            self.total_counter_[context][original_letter] += 1 

def predict_proba(self, context, last_letter):
    """ Estimate of probability of last_letter being missed after context"""
    if self.order:
        local = context[-self.order:]
    else:
        local = ''
    missed_freq = self.missed_counter_[local][last_letter] + self.smoothing_missed
    total_freq = self.total_counter_[local][last_letter] + self.smoothing_total
    return missed_freq / total_freq
Zhao
  • 1
  • 2
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. Please [Take the Tour](https://stackoverflow.com/tour) , and be sure with your [answer link](https://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers/8259#8259) – Agilanbu Nov 28 '18 at 07:38