How to programmatically find variations of a specific word in a sentence?

Question

Sometimes the data you get is not clean and has variations of the words used, misspelled or manipulated. Can we find such instances of closest resemblance of the words in a sentence?

For instance, if i am looking out for word "Awesome" which has been used as a variation in a sentence like

"We had an awwweesssommmeeee dinner at sea resort"
"We had an awesomeeee dinner at sea resort"
"We had an awwesooomee dinner at sea resort"
etc..

You have to think about accidentally selecting words that shouldn't match like `"awful"`. There's is no easy answer. Start with `agrep("awesome", x, max.distance=0.5, ignore.case=TRUE)` to see how Levenshtein distance works. — Pierre L, Jun 14 '16 at 20:15
You are probably looking for http://datascience.stackexchange.com/ — Frank, Jun 14 '16 at 20:17

score 0 · Answer 1 · answered Jun 14 '16 at 20:16

Are you wanting to do this purely in SQL?

Otherwise you will need some fuzzy-matching string comparison function to call in SQL. The function would use some combination of algorithms such as Jaro-Winkler, levenshtein, ngrams, et. Or phonetic matching metaphone double metaphone, metaphone 3, soundex

Depending on what sql-server you are using you could install and use the Data Quality Components which has custom CLR implementation of some of those algorythms. Or SSIS fuzzy matching components. Or.....

I personally have coded c# .net clr functions to do it for me but I am only dealing with names, sentences gets way more complicated and you will probably want to split to words/tokens for comparison as parts and then as whole....

score 0 · Answer 2 · answered Jun 18 '16 at 19:44

As a quick solution, you could lowercase your documents, tokenize them on whitespace, and collapse consecutive characters of each term:

import java.util.Map;
import java.util.Scanner;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.stream.Collectors;

public class CollapseConsecutiveCharsDemo {

    public static String collapse(final String term) {
        final StringBuilder buffer = new StringBuilder();
        if (!term.isEmpty()) {
            char prev = term.charAt(0);
            buffer.append(prev);
            for (int i = 1; i < term.length(); i += 1) {
                final char curr = term.charAt(i);
                if (curr != prev) {
                    buffer.append(curr);
                    prev = curr;
                }
            }
        }
        return buffer.toString();
    }

    public static void main(final String... documents) {
        final Map<String, Set<String>> termVariations = new TreeMap<>();

        for (final String document : documents) {
            final Scanner scanner = new Scanner(document.toLowerCase());
            while (scanner.hasNext()) {
                final String expandedTerm = scanner.next();
                final String collapsedTerm = collapse(expandedTerm);
                Set<String> variations = termVariations.get(collapsedTerm);
                if (null == variations) {
                    variations = new TreeSet<String>();
                    termVariations.put(collapsedTerm, variations);
                }
                variations.add(expandedTerm);
            }
        }

        for (final Map.Entry<String, Set<String>> entry : termVariations.entrySet()) {
            final String term = entry.getKey();
            final Set<String> variations = entry.getValue();
            System.out.printf("variations(\"%s\") = {%s}%n",
                term,
                variations.stream()
                    .map((variation) -> String.format("\"%s\"", variation))
                    .collect(Collectors.joining(", ")));
        }
    }
}

Example run:

% java CollapseConsecutiveCharsDemo "We had an awwweesssommmeeee dinner at sea resort" "We had an awesomeeee dinner at sea resort" "We had an awwesooomee dinner at sea resort"
variations("an") = {"an"}
variations("at") = {"at"}
variations("awesome") = {"awesomeeee", "awwesooomee", "awwweesssommmeeee"}
variations("diner") = {"dinner"}
variations("had") = {"had"}
variations("resort") = {"resort"}
variations("sea") = {"sea"}
variations("we") = {"we"}

For a more-elaborate solution, you could tokenize your documents with the Stanford CoreNLP tokenizer, which handles punctuation correctly, and combine it with spelling correction such as with liblevenshtein.

How to programmatically find variations of a specific word in a sentence?

2 Answers2