-2

The following code splits each lines into words and store the first words in each line into array list and the second words into another array list and so on. Then it selects the most frequent word from each list as correct word.

Module Module1

Sub Main()
    Dim correctLine As String = ""
    Dim line1 As String = "Canda has more than ones official language"
    Dim line2 As String = "Canada has more than one oficial languages"
    Dim line3 As String = "Canada has nore than one official lnguage"
    Dim line4 As String = "Canada has nore than one offical language"

    Dim wordsOfLine1() As String = line1.Split(" ")
    Dim wordsOfLine2() As String = line2.Split(" ")
    Dim wordsOfLine3() As String = line3.Split(" ")
    Dim wordsOfLine4() As String = line4.Split(" ")


    For i As Integer = 0 To wordsOfLine1.Length - 1
        Dim wordAllLinesTemp As New List(Of String)(New String() {wordsOfLine1(i), wordsOfLine2(i), wordsOfLine3(i), wordsOfLine4(i)})
        Dim counts = From n In wordAllLinesTemp
        Group n By n Into Group
        Order By Group.Count() Descending
        Select Group.First
        correctLine = correctLine & counts.First & " "
    Next
    correctLine = correctLine.Remove(correctLine.Length - 1)
    Console.WriteLine(correctLine)
    Console.ReadKey()

End Sub
End Module

My Question: How can I make it works with lines of different number of words. I mean that the length of each lines here is 7 words and the for loop works with this length (length-1). Suppose that line 3 contains 5 words.

myahia
  • 3
  • 2
  • 4
    Sounds good. What is your question? – Nico Schertler Feb 28 '18 at 13:11
  • This is a tough problem. You need a matching algorithm between the words that preserves order and minimizes something like the [edit distance](https://en.wikipedia.org/wiki/Edit_distance), i.e. match `Canda` to `Canada` and so on. I did [something similar with time codes](https://nicoschertler.wordpress.com/2014/06/13/matching-error-prone-sequences-of-numbers-e-g-time-codes-to-each-other/). You could probably adapt the approach by exchanging the distance measure (and you don't need the shift optimization). – Nico Schertler Feb 28 '18 at 17:35
  • Thanks Nico Schertler for your reply. I know about the edit distance problem but mu question here is only about how to make the for loop For i As Integer = 0 To wordsOfLine1.Length - 1 works with lines of different number of words. As you can see that in the above example each line contains exactly 7 words so the line inside the loop Dim wordAllLinesTemp As New List(Of String)(New String() {wordsOfLine1(i), wordsOfLine2(i), wordsOfLine3(i), wordsOfLine4(i)}) works perfectly, but if any line has different number of words this line will arise exception – myahia Feb 28 '18 at 17:56
  • I know. This is why you need a matching of some sort. – Nico Schertler Feb 28 '18 at 18:46

1 Answers1

0

EDIT: Accidentally had correctIndex where shortest should have been.

From what I can tell you are trying to see which line is the closest to the correctLine.

You can get the levenshtein distance using the following code:

Public Function LevDist(ByVal s As String,
                                ByVal t As String) As Integer
    Dim n As Integer = s.Length
    Dim m As Integer = t.Length
    Dim d(n + 1, m + 1) As Integer

    If n = 0 Then
        Return m
    End If

    If m = 0 Then
        Return n
    End If

    Dim i As Integer
    Dim j As Integer

    For i = 0 To n
        d(i, 0) = i
    Next

    For j = 0 To m
        d(0, j) = j
    Next

    For i = 1 To n
        For j = 1 To m

            Dim cost As Integer
            If t(j - 1) = s(i - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1),
                               d(i - 1, j - 1) + cost)
        Next
    Next

    Return d(n, m)
End Function

And then, this would be used to figure out which line is closest:

    Dim correctLine As String = ""
    Dim line1 As String = "Canda has more than ones official language"
    Dim line2 As String = "Canada has more than one oficial languages"
    Dim line3 As String = "Canada has nore than one official lnguage"
    Dim line4 As String = "Canada has nore than one offical language"
    Dim lineArray As new ArrayList
    Dim countArray As new ArrayList

    lineArray.Add(line1)
    lineArray.Add(line2)
    lineArray.Add(line3)
    lineArray.Add(line4)

    For i = 0 To lineArray.Count - 1
        countArray.Add(LevDist(lineArray(i), correctLine))
    Next

    Dim shortest As Integer = Integer.MaxValue
    Dim correctIndex As Integer = 0
    For i = 0 To countArray.Count - 1
        If countArray(i) <= shortest Then
            correctIndex = i
            shortest = countArray(i)
        End If
    Next
    Console.WriteLine(lineArray(correctIndex))
SWB
  • 24
  • 3
  • I have run your code and it always gives the first line as a result. I added "New" keyword in the declaration of lineArray and countArray – myahia Feb 28 '18 at 20:43