0

P.S. if there are diffrent weightage for addition , replacement and deletion . Than is there any algorithm which could help me .

Or, what sort of modifications are required in Wagner–Fischer algorithm so as to minimize the edit distance if weights for addition/deletion and replacement are diffrent ?

dhruvsharma
  • 125
  • 1
  • 13
  • You can modify the [Wagner-Fischer algorithm](http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm#Possible_improvements) to use linear space if you only care about the edit distance and not the actual sequence of edits. – Nemo Oct 06 '14 at 18:07
  • 1
    Edit distance space requirement is O(N). I don't see you being able to reduce it to less than that. – Sergey Kalinichenko Oct 06 '14 at 18:07
  • @Nemo can u tell me other names of wagner-fischer algo ? – dhruvsharma Oct 06 '14 at 18:18
  • Note that it's possible to get the actual sequence of edits in linear space and the usual running time: http://web.engr.illinois.edu/~jeffe/teaching/algorithms/notes/06-sparsedynprog.pdf – David Eisenstat Oct 06 '14 at 18:24

3 Answers3

0

Most optimal I know by far is Levenshtein, you can also have a look into this publication. Hope it helps :)

Sameer Shemna
  • 886
  • 10
  • 19
0

Don't know if you are aware, but since each line in edit distance PD depends only in the previous one, you can keep only the two last lines. This way you can achieve O(n) space complexity, instead of the O(n^2) in the naïve implementation.

Example in Python (assuming cost 2 for replacement, 3 for addition and 5 for deletion):

def levenshtein(s1, s2):
    A = [0]*(len(s2)+1)
    B = range(len(s2)+1)
    for i, c1 in enumerate(s1, 1):
        A[0] = i
        for j, c2 in enumerate(s2, 1):
            if c1 == c2:
                A[j] = B[j-1]
            else:
                A[j] = min(B[j-1]+2, A[j-1]+3, B[j]+5)
        A,B = B,A

    return B[len(s2)]

print levenshtein('kitten', 'sitting')
Juan Lopes
  • 10,143
  • 2
  • 25
  • 44
  • Great! You can actually tweak it a little more and get away with a replace one of the integer arrays with a couple of integer variables. Also you can guarantee the array is min(|s1|,|s2|) in size. This is worthwhile if one string is much larger than the other. – Gene Oct 06 '14 at 19:15
  • what if i am having diffrent weight for addition/deletion and replacement and a value of total weight is limited . then can u suggest any approach ? – dhruvsharma Oct 06 '14 at 19:33
  • My code already handle different weight for addition, deletion and replacement. And I can't see any problem with the total weight being limited, as the weight strictly increases as the algorithm progresses. – Juan Lopes Oct 06 '14 at 19:47
  • can u please explain how it handles the diffrent weights ? – dhruvsharma Oct 06 '14 at 20:31
  • This line: `min(B[j-1]+2, A[j-1]+3, B[j]+5)`, 2 is the weight for replacement, 3 for addition, 5 for deletion. – Juan Lopes Oct 06 '14 at 21:27
  • might your initial wieghts be wrong? eg. should A[0] be i*5? and something similar for initial B=range(len(s2)+1)*3 (guessing syntax)? – philcolbourn Nov 14 '18 at 15:18
0

I modified my implementation of Wagner Fischer with weightings for insertion, deletion and substitution (cins, cdel, csub).

#!/bin/bash

set -f

# based on https://github.com/osteslag/Changeset/blob/master/Sources/Changeset.swift
# https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

declare A="the quick brown fox jumps  over a lazy dog"
declare B="the quick brown fox jumped        lazy dog and a brown and black dog"

declare RET

join(){
  local A="$1"
  RET="$A"
  RET="${RET//[?] [?]/ }"
  RET="${RET//[+] [+]/ }"
  RET="${RET//[-] [-]/ }"
  return 0
}

edits(){
  local  -a    A=("" ${1//[^a-zA-Z0-9]/ })  # remove all punctuation FIXME: could surround with space
  local  -a    B=("" ${2//[^a-zA-Z0-9]/ })
  local -i  rows=${#A[*]}
  local -i  cols=${#B[*]}
  local  -a pROW=("")
  local  -a cROW
  local -ia pdst=(0)
  local -ia cdst=(0)
  local -i  min r c
  local -i  ins cins=1
  local -i  del cdel=1
  local -i  sub csub=10

  # fill first row of insertions

  for((c=1; c<cols; c++)); do         # for each target +1
      pROW[c]="${pROW[c-1]} +\e[42m${B[c]}\e[m+"
    ((pdst[c]=c*cins))
  done

  ((rows==0)) && return 1

  for((r=1; r<rows; r++)); do

    # first column are deletions to get ""

      cROW[0]="${pROW[0]} -\e[41m${A[r]}\e[m-"
    ((cdst[0]=pdst[0]+cdel))

    ((cols>0)) && {

      #  X  0  T1  T2  T3
      #  0  0  i   i   i
      # S1  d  
      # S2  d

      #     c-1 c
      # r-1 SUB DEL
      # r   INS 

      for((c=1; c<cols\; c++)); do
        if [[ "${A[r]}" = "${B[c]}" ]]; then  # source and target match - no operation
            cROW[c]="${pROW[c-1]} ${A[r]}"
          ((cdst[c]=   pdst[c-1]))
        else
          ((ins=cdst[c-1], sub=pdst[c-1], del=pdst[c] ))  # 
          ((min= (del<=ins) ? ((del<=sub)?del:sub) : ins))
            if ((del==min)); then ((cdst[c]=min+cdel)); cROW[c]="${pROW[c  ]} -\e[41m${A[r]}\e[m-"
          elif ((ins==min)); then ((cdst[c]=min+cins)); cROW[c]="${cROW[c-1]} +\e[42m${B[c]}\e[m+"
          else                    ((cdst[c]=min+csub)); cROW[c]="${pROW[c-1]} ?\e[41m${A[r]}\e[m-\e[42m${B[c]}\e[m?"
          fi
          #((cdst[c]=min+1))
        fi
      done
    }
    pROW=("${cROW[@]}")
    pdst=( ${cdst[*]} )
  done

#  printf "%s "    "${pROW[@]}"    ; printf "\n"
#  printf "%s\n\n" "${pROW[@]: -1}"
  join "${pROW[*]: -1}"
  printf "%b\n\n" "$RET"
  return 0
}

edits "s e t t i n g" "k i t t e n"

edits "$A" "$B"
exit 0

Outputs:

$ ./pc-wf.bash
 -s e- +k i+ t t -i n- +e+ ?g-n?

 the quick brown fox -jumps over- +jumped lazy dog and+ a +brown and+ ?lazy-black? dog
philcolbourn
  • 4,042
  • 3
  • 28
  • 33