2

Damerau-Levenshtein distance is like:

"abcd", "aacd" => 1 DL distance
"abcd", "aadc" => 2 DL distance

I can use pyxDamerauLevenshtein modul in python to determine the DL distance of 2 words. I would like to make a generator method which can produce every words of a given keyword parameter in a given DL distance. I deal with 1 or 2 DL distances only.

Are any tool in python which can I use to generate words of a word in a given DL distance?

Kroy
  • 299
  • 1
  • 5
  • 18

2 Answers2

6

Look at this Norvig's article: How to Write a Spelling Corrector.

It contains the exact code that you need:

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))
skovorodkin
  • 9,394
  • 1
  • 39
  • 30
  • Thanks! I evaluated the generated result. Used from `pyxdameraulevenshtein` the `damerau_levenshtein_distance` method which give the number of the DL distance. `edits1()` works fine. `edits2(`) generated a really huge list which contains some output what is more than 2 DL distance. Fore example: `abcd-boacd` I think the edits2 method is not working perfectly. Or am I wrong and this outputs belongs into 2 DL distances? – Kroy Oct 04 '16 at 18:29
  • It does. `boacd` →delete→ `bacd` →transpose→ `abcd`. – skovorodkin Oct 04 '16 at 18:33
  • Than can we say 'damerau_levenshtein_distance' is not correct? – Kroy Oct 04 '16 at 18:48
  • From `damerau_levenshtein_distance` docstring: This implements the "optimal string alignment distance" algorithm, as described by Wikipedia here: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance. And a quote from Wikipedia's article: Adding transpositions adds significant complexity. – skovorodkin Oct 04 '16 at 18:54
  • So Peter Norvig's code generates all possible words in a given distance, but `damerau_levenshtein_distance` may have another "opinion" on those words' distances. If you need the resulting list to conform to that algorithm, you can use `damerau_levenshtein_distance` and just filter out words with distance > 2. – skovorodkin Oct 04 '16 at 18:58
0

Both the edit1 and edit2 function above are within 1 or 2 edit distance not exactly 1 or 2 edit distance. I made these 3 functions where all_strings_editx will return strings (random order) of exactly x edit distance away from the input string.

def all_strings_within_edit1(sequence, bases='ATCG'):
    """
    All edits that are one edit away from `sequence`
    using a dictionary of bases.

    Parameters
    ----------
    sequence: str
    bases: str

    Returns
    -------
    sequences: list of str

    """
    splits = [(sequence[:i], sequence[i:]) for i in range(len(sequence) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    # In the original code, transpose counts one edit distance
    # We count it as two edit distances, so it's not included here
    # transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in bases]
    inserts = [L + c + R for L, R in splits for c in bases]
    return deletes + replaces + inserts


def all_strings_within_editx(sequence, bases='ATCG', edit_distance=1):
    """
    Return all strings with a give edit distance away from

    Parameters
    ----------
    sequence: str
    bases: str
    edit_distance: int

    Returns
    -------
    sequences: set of str

    """
    if edit_distance == 0:
        return {sequence}
    elif edit_distance == 1:
        return set(all_strings_within_edit1(sequence, bases=bases))
    else:
        return set(
            e2 for e1 in all_strings_within_editx(
                sequence, bases=bases, edit_distance=edit_distance-1)
            for e2 in all_strings_within_edit1(e1, bases=bases)
        )
    

def all_strings_editx(sequence, bases='ATCG', edit_distance=1):
    """
    Return all strings of a give edit distance away from `sequence`

    Parameters
    ----------
    sequence: str
    bases: str
    edit_distance: int

    Returns
    -------
    result: generator of str

    """
    if edit_distance == 0:
        return [sequence]
    all_editx_minus1 = all_strings_within_editx(
        sequence, bases=bases, edit_distance=edit_distance-1)
    return (
        e2 for e1 in all_editx_minus1
        for e2 in all_strings_within_edit1(e1, bases=bases)
        if e2 not in all_editx_minus1
    )
david190810
  • 11
  • 1
  • 5