3

Here is what I am trying to do: I want to measure the Levensthein distance between two strings, using bash. I found an implementation of the LD here.

Now, suppose that I have some toy data like so:

1    The brown fox jumped    The green fox jumped
0    The red fox jumped    The green fox jumped
1    The gray fox jumped    The green fox jumped

and lets say that this is stored in data.test.

Then I put it through a simple awk command which filters out the rows which start with 1 like so:

awk -F '\t' '{if ($1>0) print $2,t,$3}' data.test

The first output from this simple command will then be:

The brown fox jumped    The green fox jumped

I now want to measure the Levensthein distance between these two sentences, by piping this output directly to this function (lifted from the above link):

function levenshtein {
    if (( $# != 2 )); then
        echo "Usage: $0 word1 word2" >&2
    elif (( ${#1} < ${#2} )); then
        levenshtein "$2" "$1"
    else
        local str1len=${#1}
        local str2len=${#2}
        local d

        for i in $( seq 0 $(( (str1len+1)*(str2len+1) )) ); do
            d[i]=0
        done

        for i in $( seq 0 $str1len );   do
            d[i+0*str1len]=$i
        done

        for j in $( seq 0 $str2len );   do
            d[0+j*(str1len+1)]=$j
        done

        for j in $( seq 1 $str2len ); do
            for i in $( seq 1 $str1len ); do
                [ "${1:i-1:1}" = "${2:j-1:1}" ] && local cost=0 || local cost=1
                del=$(( d[(i-1)+str1len*j]+1 ))
                ins=$(( d[i+str1len*(j-1)]+1 ))
                alt=$(( d[(i-1)+str1len*(j-1)]+cost ))
                d[i+str1len*j]=$( echo -e "$del\n$ins\n$alt" | sort -n | head -1 )
            done
        done
        echo ${d[str1len+str1len*(str2len)]}
    fi
}

I know you can do this, but I am getting stuck by there being two arguments that need passing, and the fact that I am passing sequences.

I have tried using various versions of this suggestion, which advocates grabbing the input as such:

function levenshtein {
    # Grab input.
    declare input1=${1:-$(</dev/stdin)};
    declare input2=${2:-$(</dev/stdin)};
.
.
.
}

This is the part I cannot quite get to work.

Astrid
  • 1,846
  • 4
  • 26
  • 48
  • It'd be much faster in awk. A quick google produces https://rosettacode.org/wiki/Levenshtein_distance#AWK, https://awkology.wordpress.com/2012/01/23/levenshtein-distance/, and others. – Ed Morton May 09 '19 at 18:56
  • Can I pipe stuff into the awk version as demoed in the question? – Astrid May 09 '19 at 22:23
  • I don't see it being demoed in the question but yes, awk can read input from a pipe or a file. – Ed Morton May 10 '19 at 00:02
  • Oh fair, I meant to say the "demoed intent" of my question i.e. just piping the output to a function. – Astrid May 13 '19 at 08:13

3 Answers3

7

You don't need awk at all:

while IFS=$'\t' read num first second; do
    [[ $num -gt 0 ]] || continue
    levenshtein "$first" "$second"
done < data.txt

(True, awk is faster at processing a large file than bash, but if you are implementing the Levenshtein algorithm in bash in the first place, speed is probably not a concern.)


As an aside, a simpler (though minimally tested) implementation that doesn't require so much index arithmetic by using an associative array with "tuples" as keys.

levenshtein () {
  if (( ${#1} < ${#2} )); then
    levenshtein "$2" "$1"
    return
  fi

  local str1len str2len cost m a b i j
  local -A d

  str1len=${#1}
  str2len=${#2}
  for ((i=0;i<=strlen1;i++)); do
    d[$i,0]=0
  done

  for ((j=0;j<=strlen2;j++)); do
    d[0,$j]=0
  done

  for ((j=1; j<=str2len; j++)); do
    for ((i=1; i<=str1len; i++)); do
      a=${1:i-1:1}
      b=${2:j-1:1}
      [ "$a" = "$b" ] && cost=0 || cost=1
      del=$(( $d[$((i-1)),$j] + 1 ))
      ins=$(( $d[$i,$((j-1))] + 1 ))
      alt=$(( $d[$((i-1)),$((j-1))] + cost ))

      # Compute the min without forking
      m=$del; ((ins < m)) && m=$ins; ((alt < m)) && m=$alt

      d[$i,$j]=$m
    done
  done
  echo ${d[$str1len,$str2len]}
} 
codeforester
  • 39,467
  • 16
  • 112
  • 140
chepner
  • 497,756
  • 71
  • 530
  • 681
  • Picking up on your comment regarding speed, what would you recommend for that? I am doing it in bash because I reckon it is a lot quicker than python, but that is basically the only reason. Would something else be faster? (also, great answer) – Astrid May 13 '19 at 08:11
  • Why do you think `bash` is faster than Python? `bash` is intended to run other programs, not do calculations itself. – chepner May 13 '19 at 12:33
  • Well, doing this stuff in Pandas say, over hundreds of thousands of sentence pairs (several gb of data in my case) is going to and has crashed my computer before. And I hear people advocating bash as a substitute for large operations like that. – Astrid May 13 '19 at 12:34
  • You are probably running out of memory; you aren't crashing because Python is "slower" than `bash`. You are either misinterpreting what these "people" are saying, or you are listening to grossly misinformed people. – chepner May 13 '19 at 12:36
1

If you export the Levenshtein function in bash before calling awk with export -f levenshtein, you can easily call the function in awk line by line: awk -F '\t' '$1>0 {system("levenshtein \""$2"\" \""$3"\"")}'.

xash
  • 3,702
  • 10
  • 22
1

My upvote goes to Chepner's answer, but if for some reason you find yourself stuck in a place where you actually need to solve this, that's not too hard either.

# Awk script refactored slightly for aesthetics
pair=$(awk -F '\t' '$1>0 {print $2 "\t" $3}' data.test)
levenshtein "${pair%$'\t*'}" "${pair#$'*\t'}"

To slightly unpack this;

  • The two arguments to levenshtein are in double quotes.
  • Each argument consists of a parameter substitution;
    • ${variable%pattern} yields the value of variable with any suffix which matches pattern removed
    • ${variable#pattern} yields the value of variable with any prefix which matches pattern removed
    • These both match the shortest possible pattern. If you have a string with multiple fields, you might need the ## or %% variants which trim the longest applicable pattern from the front or the back of the value, respectively.
  • $'\t' is a C-style string which contains a tab
  • The pattern also contains a * in front of or behind the tab to remove everything before or after the tab, as required to obtain just the first or the second value from the tab-separated string.
tripleee
  • 175,061
  • 34
  • 275
  • 318