-4

Sometimes the data you get is not clean and has variations of the words used, misspelled or manipulated. Can we find such instances of closest resemblance of the words in a sentence?

For instance, if i am looking out for word "Awesome" which has been used as a variation in a sentence like

"We had an awwweesssommmeeee dinner at sea resort"
"We had an awesomeeee dinner at sea resort"
"We had an awwesooomee dinner at sea resort"
etc..
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Mindfreak
  • 17
  • 6
  • You have to think about accidentally selecting words that shouldn't match like `"awful"`. There's is no easy answer. Start with `agrep("awesome", x, max.distance=0.5, ignore.case=TRUE)` to see how Levenshtein distance works. – Pierre L Jun 14 '16 at 20:15
  • You are probably looking for http://datascience.stackexchange.com/ – Frank Jun 14 '16 at 20:17

2 Answers2

0

Are you wanting to do this purely in SQL?

Otherwise you will need some fuzzy-matching string comparison function to call in SQL. The function would use some combination of algorithms such as Jaro-Winkler, levenshtein, ngrams, et. Or phonetic matching metaphone double metaphone, metaphone 3, soundex

Depending on what sql-server you are using you could install and use the Data Quality Components which has custom CLR implementation of some of those algorythms. Or SSIS fuzzy matching components. Or.....

I personally have coded c# .net clr functions to do it for me but I am only dealing with names, sentences gets way more complicated and you will probably want to split to words/tokens for comparison as parts and then as whole....

Matt
  • 13,833
  • 2
  • 16
  • 28
0

As a quick solution, you could lowercase your documents, tokenize them on whitespace, and collapse consecutive characters of each term:

import java.util.Map;
import java.util.Scanner;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.stream.Collectors;

public class CollapseConsecutiveCharsDemo {

    public static String collapse(final String term) {
        final StringBuilder buffer = new StringBuilder();
        if (!term.isEmpty()) {
            char prev = term.charAt(0);
            buffer.append(prev);
            for (int i = 1; i < term.length(); i += 1) {
                final char curr = term.charAt(i);
                if (curr != prev) {
                    buffer.append(curr);
                    prev = curr;
                }
            }
        }
        return buffer.toString();
    }

    public static void main(final String... documents) {
        final Map<String, Set<String>> termVariations = new TreeMap<>();

        for (final String document : documents) {
            final Scanner scanner = new Scanner(document.toLowerCase());
            while (scanner.hasNext()) {
                final String expandedTerm = scanner.next();
                final String collapsedTerm = collapse(expandedTerm);
                Set<String> variations = termVariations.get(collapsedTerm);
                if (null == variations) {
                    variations = new TreeSet<String>();
                    termVariations.put(collapsedTerm, variations);
                }
                variations.add(expandedTerm);
            }
        }

        for (final Map.Entry<String, Set<String>> entry : termVariations.entrySet()) {
            final String term = entry.getKey();
            final Set<String> variations = entry.getValue();
            System.out.printf("variations(\"%s\") = {%s}%n",
                term,
                variations.stream()
                    .map((variation) -> String.format("\"%s\"", variation))
                    .collect(Collectors.joining(", ")));
        }
    }
}

Example run:

% java CollapseConsecutiveCharsDemo "We had an awwweesssommmeeee dinner at sea resort" "We had an awesomeeee dinner at sea resort" "We had an awwesooomee dinner at sea resort"
variations("an") = {"an"}
variations("at") = {"at"}
variations("awesome") = {"awesomeeee", "awwesooomee", "awwweesssommmeeee"}
variations("diner") = {"dinner"}
variations("had") = {"had"}
variations("resort") = {"resort"}
variations("sea") = {"sea"}
variations("we") = {"we"}

For a more-elaborate solution, you could tokenize your documents with the Stanford CoreNLP tokenizer, which handles punctuation correctly, and combine it with spelling correction such as with liblevenshtein.

Dylon
  • 1,730
  • 15
  • 14