0

I have written a very simple program using a nuget package in c# to read in 2 csv files and fuzzy match them and output a new csv file with all the matches. The problem is i need the program to be able to read and compare files up to 700k and comparw it to 100k. I havent been able to find a way to speed up the process. Is there any way i can do this? I will even use another language if need be.

you can ignore all the commented code its just there for when i was using it for testing purposes. sorry im a newer programmer.

the read csv funciton is for reading in the csv. the rest is code inside another function where i pass in the string arrays to pass them through fuzzymatch

static string[] ReadCSV(string path)
{
    List<string> name = new List<string>();
    List<string> address = new List<string>();
    List<string> city = new List<string>();
    List<string> state = new List<string>();
    List<string> zip = new List<string>();

    using (var reader = new StreamReader(path))
    {
        reader.ReadLine();
        while (!reader.EndOfStream)
        {
            var line = reader.ReadLine();
            var values = line.Split(',');

            name.Add(values[0] +", "+ values[1]);
            //address.Add(values[1]);
            //city.Add(values[2]);
            //state.Add(values[3]);
            //zip.Add(values[4]);

        }
    }

    string[] name1 = name.ToArray();

    return name1;
    //foreach (var item in name)
    //{
    //    Console.WriteLine(item.ToString());
    //}
}


 StringBuilder csvcontent = new StringBuilder();
    string csvpath = @"C:\Users\bigel\Documents\outputtest.csv";
    csvcontent.AppendLine("Name,Address,Match");

    //Console.WriteLine("Levenshtein Edit Distance:");
    int x = 1;
    foreach (var name in string1)
    {
        for (int i = 0; i < length; i++)
        {
            int leven = match[i].LevenshteinDistance(name);
            //Console.WriteLine(match[i] + "\t{0} against {1}", leven, name);
            if (leven <= 7)
            {
                output[i] = input[i] + ",match";
                csvcontent.AppendLine(output[i]);

                //Console.WriteLine(match[i] + " " + leven + " against " + name + " is a Match");
                //Console.WriteLine(output[i]);
            }
            else
            {
                if (i == 500)
                {

                    Console.WriteLine(x);
                    x++;

                }
            }
        }

    }
    File.AppendAllText(csvpath, csvcontent.ToString());
jbigs89
  • 11
  • 3
  • How are you defining a 'fuzzy match'? Can you show your code? – stuartd Mar 06 '20 at 21:55
  • There is no code to speed up. Is this a game for us to beat your imaginary code? – TheGeneral Mar 06 '20 at 22:19
  • 1
    Updated with code – jbigs89 Mar 06 '20 at 22:56
  • Since no answer yet, a hint is to try adding parallelization. Assuming that the call to the fuzzy matching is the most expensive operation, try converting the top most foreach loop into [PLINQ](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/parallel-linq-plinq). Depending on how its written be wary that locks() may be required to combine the results into a common structure. First though, you may want to profile the code (using the Stopwatch() class) to see how long it takes to parse vs match vs write the file. – crokusek Mar 07 '20 at 19:01
  • Is your LevenshteinDistance calculator using the [Damerau–Levenshtein algorithm](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) that allows for adjacent transpositions or the restricted edit distance one? The former adds significant complexity, and thus processing time. – stuartd Mar 08 '20 at 15:30
  • As Wikipedia says _"Adding transpositions adds significant complexity. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction_" - which one you need to use depends on your use case. – stuartd Mar 08 '20 at 15:30

0 Answers0