Levenshtein Distance on two files taking too much time

Question

I am new to programming, and I am building a file similarity finder, which finds out how similar two files are. So far, I am storing the files as two strings and then using levenshtein distance for finding out how similar the files are.

The problem is, the execution time without levenshtein distance is 206ms, which is due to file to string conversion. When I use the levenshtein distance, execution time is a whopping 19504ms

Nearly 95 times the time taken to convert a file to a string, which makes this a bottleneck in my project

Any help would be appreciated I am comfortable in C, C++, and Python. If you can point me to any source, I would be grateful

Here is the C++ code for the function I am using for calculating Levenshtein distance:

//LEVENSHTEIN
int levenshtein(std::string a, std::string b){
  int len_a = a.length();
  int len_b = b.length();
  int d[len_a + 1][len_b+1];

  for(int i = 0; i < len_a + 1; i++)
    d[i][0] = i;

  for(int j = 0; j < len_b + 1; j++)
    d[0][j] = j;

  for(int i = 1; i < len_a + 1; i++){
    for(int j = 1; j < len_b + 1; j++){
      if(a[i - 1] == b[j - 1]){
        d[i][j] = d[i - 1][j - 1];
      }
      else{
        d[i][j] = 1 + min(min(d[i][j-1],d[i-1][j]),d[i-1][j-1]);
      }
    }
  }

  int answer = d[len_a][len_b];

  return answer;
}

I have to compare just two files, and not more. I read about the usage of trie in levenshtein, but that is useful for comparing multiple strings to the source. Apart from that, I haven't had much luck

I would accept the answer to be in reference to python as well. I am open to making the program in both languages — gourgan, Jun 28 '20 at 14:49
Did you compile your program with compiler optimizations enabled? — Jesper Juhl, Jun 28 '20 at 14:50
That's very accepting of you. However, we are not going to write entire pieces of code for you in any particular language. If your question has no python in it, remove the tag. (c is a different language as well, remove that tag too). — cigien, Jun 28 '20 at 14:50
I do not expect entire pieces of code. I would be happy to be pointed towards any source which is aimed towards C, C++ or Python. — gourgan, Jun 28 '20 at 14:52
Your `algorithm` looks more like Needlemann-Wunch, which is quadratic. [but: the tail end is missing] — wildplasser, Jun 28 '20 at 14:56
How long the strings are? Your algorithm is `O(a.length() * b.length())`, which means the running time will approximately be linear of the length of `a` times the length of `b`. — MikeCAT, Jun 28 '20 at 14:56
That makes the question off-topic, since you are asking for recommendations for external resources. — cigien, Jun 28 '20 at 14:57
The algorithm you use is not the most efficient (only two rows are necessary, not the full matrix). But in any case Levenshtein distance cannot be computed faster than `O(n^2)`, so don't expect any significant improvement. There are ways to calculate [an approximate value](https://en.wikipedia.org/wiki/Levenshtein_distance#Approximation) in roughly linear time - see if that's good enough. — Igor Tandetnik, Jun 28 '20 at 15:00
[This program](https://godbolt.org/z/eE87Db) compiles, runs and prints output in a couple seconds, basically too fast for me to measure. There's no way it takes 19 seconds on inputs of 200 and 300 characters long. — Igor Tandetnik, Jun 28 '20 at 15:08
I am using atom with gpp-compiler package, I dont know much about compilers — gourgan, Jun 28 '20 at 15:11
@gourgan By default most compilers will generate unoptimized (debug) binaries that make debugging easy but are slow. When measuring performance (or shipping your final product) you always want to turn on optimization (aka do a "release build"). How to do that differs from compiler to compiler. For `clang` or `gcc`, adding `-O2` to the compilers command line is a good start. — Jesper Juhl, Jun 28 '20 at 15:42
@gourgan `int d[len_a + 1][len_b+1];` -- This is not valid C++. C++ requires array sizes to be denoted by compile-time constants, not runtime variables. Instead `std::vector> d(len_a + 1, std::vector(len_b + 1));` -- Also, if those strings were longer than 200 or 300, you risk blowing out the stack memory using that non-standard C++ syntax. So what you posted is neither python, C++, or even C, since C doesn't have `std::string`. — PaulMcKenzie, Jun 28 '20 at 17:33

score 1 · Answer 1 · answered Jun 28 '20 at 18:34

I will show you a C++ solution. The language used is C++17. Compiler is MS Visual Studio Community 2019. Compiled in Release mode with all optimizations on.

I created two files with 1000 words each with an "Lorem ipsum sum" generator. The file size for each file is ~6kB.

The result is available in a blink of an eye.

I am using a slighty modified levensthein function and do also use more readable variable names. I do not use a VLA (Variable Length Array), because this is not valid in C++. I use a std::vector instead, which has even superior functionality.

In main, we can see the driver code. First, we open the 2 input files, and check, if they could be opened. If not, we show an error message and quit the program.

Then we read the 2 text files into 2 strings, by using the std::string range constructor and the std::istreambuf_iterator. I do not know any simpler way for reading a complete text file into a std::string.

Then we print the result of the Levensthein distance.

Please see the code below:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator>

// Distance between 2 strings
size_t levensthein(const std::string& string1, const std::string& string2)
{
    // First get the string lengths
    const size_t lengthString1{ string1.size() };
    const size_t lengthString2{ string2.size() };

    // If one of the string length is 0, then return the length of the other
    // This results in 0, if both lengths are 0
    if (lengthString1 == 0) return lengthString2;
    if (lengthString2 == 0) return lengthString1;

    // Initialize substitition cost vector
    std::vector<size_t> substitutionCost(lengthString2 + 1);
    std::iota(substitutionCost.begin(), substitutionCost.end(), 0);

    // Calculate substitution cost
    for (size_t indexString1{}; indexString1 < lengthString1; ++indexString1) {
        substitutionCost[0] = indexString1 + 1;
        size_t corner{ indexString1 };

        for (size_t indexString2{}; indexString2 < lengthString2; ++indexString2) {
            size_t upper{ substitutionCost[indexString2 + 1] };
            if (string1[indexString1] == string2[indexString2]) {
                substitutionCost[indexString2 + 1] = corner;
            }
            else {
                const size_t temp = std::min(upper, corner);
                substitutionCost[indexString2 + 1] = std::min(substitutionCost[indexString2], temp) + 1;
            }
            corner = upper;
        }
    }
    return substitutionCost[lengthString2];
}

// Put in your filenames here
const std::string fileName1{ "text1.txt" };
const std::string fileName2{ "text2.txt" };

int main() {

    // Open first file and check, if it could be opened
    if (std::ifstream file1Stream{ fileName1 }; file1Stream) {

        // Open second file and check, if it could be opened
        if (std::ifstream file2Stream{ fileName2 }; file2Stream) {

            // Both files are open now, read them into strings
            std::string stringFile1(std::istreambuf_iterator<char>(file1Stream), {});
            std::string stringFile2(std::istreambuf_iterator<char>(file2Stream), {});

            // Show Levenstehin distance on screen
            std::cout << "Levensthein distance is: " << levensthein(stringFile1, stringFile2) << '\n';
        }
        else {
            std::cerr << "\n*** Error. Could not open input file '" << fileName2 << "'\n";
        }
    }
    else {
        std::cerr << "\n*** Error. Could not open input file '" << fileName1 << "'\n";
    }
    return 0;
}

score 0 · Answer 2 · answered Jun 28 '20 at 14:55

0

There's a package called nltk. Check it out.

from nltk import distance
print(distance.edit_distance('aa', 'ab'))

Output:

answered Jun 28 '20 at 14:55

Balaji Ambresh

4,977
2
5
17

Levenshtein Distance on two files taking too much time

2 Answers2

Linked