-1

I'm creating a Windows form application that allows the user to specify a text file as a data source, dynamically creates the form controls based on the number of columns in the file, and allows the user to input search parameters which will be used to search the file when a search button is clicked. Any results will be written to a new text file.

The files that will be searched by this program are often quite large (up to 12 GB). My current search method (read a line, search it, add it to the results file if it's a hit) works perfectly well for reasonably sized files (a few MBs or so). With my "large" test file (~2.5 GB), it takes about 12 minutes to search the file.

So my question is: Would what would be the best way to improve performance? After much searching and reading, I know that I have the following options:

  • Async methods
  • Tasks
  • TPL dataflow
  • Some combination of these methodologies

Since the logic of my program is more a stream, I'm leaning towards dataflow, but I'm unsure as to how to implement it properly or if there may be a better solution. Below is the code for the clickEvent of the search button and functions associated with the search.

'Searches the loaded file
    Private Sub searchBtn_Click(sender As Object, e As EventArgs) Handles searchBtn.Click
        Dim strFileName As String
        Dim didWork As Integer
        Dim searchHits As Integer
        Dim watch As Stopwatch = Stopwatch.StartNew()

        'Prompts user to enter title of file to be created
        exportFD.Title = "Save as. . ."
        exportFD.Filter = "Text Files(*.txt)|*.txt" 'Limits user to only saving as .txt file
        exportFD.ShowDialog()

        If didWork = DialogResult.Cancel Then 'Handles if Cancel Button is clicked
            Return
        Else
            strFileName = exportFD.FileName
            Dim writer As New IO.StreamWriter(strFileName, False) 
            Dim reader As New IO.StreamReader(filepath)
            Dim currentLine As String

            'Skip first line of SOURCE text file for search, but use it to write column headers to file
            currentLine = reader.ReadLine()
            Dim columnLine = currentLine.Split(vbTab)

            'First: Insert column names into NEW text file
            For col As Integer = 0 To colCount - 1
                writer.Write(columnLine(col) & vbTab)
            Next
            writer.Write(vbNewLine)

            'Search whole file, line by line
            Do While reader.Peek() > 0
                'next line
                currentLine = reader.ReadLine()

                'new function:
                If validChromosome(currentLine) Then
                    writer.WriteLine(currentLine)
                    searchHits += 1
                End If
            Loop

            'Close out writer and reader and tell user file was saved
            writer.Close()
            reader.Close()
            searchTxtB.Text = searchHits.ToString()
            watch.Stop()
            MsgBox("Searched in: " + watch.Elapsed.ToString() + " and saved to: " + strFileName)
        End If

    End Sub

    'This function searches through the current line and checks if it follows what the user has searched for
    Private Function validChromosome(chromString As String) As Boolean

        'Split line by delimiter
        Dim readRow() As String = Split(chromString, vbTab)
        validChromosome = True 'Start off as true

        Dim rowLength As Integer = readRow.Length - 1

        'Iterate through string tokens and compare 
        For token As Integer = 0 To rowLength
            Try
                Dim currentGroupBox As GroupBox = criteriaPanel.Controls.Item(token)
                Dim checkedParameter As CheckBox = currentGroupBox.Controls("CheckBox")

                'User wants to search this parameter
                If checkedParameter.Checked = True Then
                    Dim numericRadio As RadioButton = currentGroupBox.Controls("NumericRadio")

                    'Searching by number
                    If numericRadio.Checked = True Then
                        Dim value As Decimal
                        Dim lowerBox As NumericUpDown = currentGroupBox.Controls("NumericBoxLower")
                        Dim upperBox As NumericUpDown = currentGroupBox.Controls("NumericBoxUpper")

                        Dim lowerInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveLowerCheckBox")
                        Dim upperInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveUpperCheckBox")

                        'Try to convert the text to a decimal. 
                        If Not Decimal.TryParse(readRow(token), value) Then
                            validChromosome = False
                            Exit For
                        End If

                       'Not within the given range user inputted for numeric search
                        If Not withinRange(value, lowerBox.Value, upperBox.Value, lowerInclusiveCheck.Checked, upperInclusiveCheck.Checked) Then
                            validChromosome = False
                            Exit For
                        End If

                    Else 'Searching by text
                        Dim textBox As TextBox = currentGroupBox.Controls("TextBox")

                        'If the comparison failed, then this chromosome is not valid. Break out of loop and return false.
                        If Not [String].Equals(readRow(token), textBox.Text.ToString(), StringComparison.OrdinalIgnoreCase) Then

                            validChromosome = False
                            Exit For

                        End If
                    End If

                End If


            Catch ex As Exception

                'Simple error checking.
                MsgBox(ex.ToString)
                validChromosome = False
                Exit For

            End Try
        Next

    End Function

    'Function to check if value safely in betweeen two values
    Private Function withinRange(value As Decimal, lower As Decimal, upper As   Decimal, inclusiveLower As Boolean, inclusiveUpper As Boolean) As Boolean
        withinRange = False
        Dim lowerCheck As Boolean = False
        Dim upperCheck As Boolean = False

        If inclusiveLower Then
            lowerCheck = value >= lower
        Else
            lowerCheck = value > lower
        End If

        If inclusiveUpper Then
            upperCheck = value <= upper
        Else
            upperCheck = value < upper
        End If

        withinRange = lowerCheck And upperCheck

    End Function

My current theory is that I should create a TransformBlock that will contain my file read method and create a small buffer (~10 lines) which would be passed to another TransformBlock that searches them and puts the results in a list, which would then by passed to another TransformBlock to be written to the export file.

It is quite likely that my search function (validChromosome) is probably not very great, so any suggestions for improvements there would also be welcome. This is my first program, and I know that VB.net likely isn't the best language for text file manipulation, but I'm being forced to use it. Thanks in advance for any help, and please let me know if any more information is needed.

T.S.
  • 18,195
  • 11
  • 58
  • 78
Jared Andrews
  • 272
  • 2
  • 12
  • You'll get a bigger performance benefit by replacing string splits with regular expressions. They don't create temporary strings thus avoid wasting CPU and memory (a huge problem when working with large files), reduce garbage collections, they are thread safe so you can use static fields to store them and they are much faster than splitting and parsing the strings themselves. Otherwise, parallelism will simply create temporary objects faster, wasting memory and causing more garbage collections – Panagiotis Kanavos Jan 12 '15 at 12:50
  • Yes, I have done just that, as well as implementing a form of the suggestion @i3arnon made below. My speed has improved by ~80%, so thanks for all of the suggestions. – Jared Andrews Jan 12 '15 at 20:29

1 Answers1

0

TPL Dataflow seems like a good fit, especially since it easily supports async.

I would keep the reading sequential since HDs mostly don't perform well in concurrent reads so there's no need for a block, simply read buffers in a while loop and post to the TDF block. Then you can have a TransformBlock that searches that buffer and moves the result to the next block that saves to a file.

The TransfromBlock can run in parallel so you should set the appropriate MaxDegreeOfParallelism (probably Environment.ProcessorCount).

i3arnon
  • 113,022
  • 33
  • 324
  • 344
  • That makes sense. I suppose for the `StreamReader`, I could just set it to wait when the max number of buffers are created and resume when one is passed to the block doing the searching. Thanks for the help. This should cut the time down by about 75%, and while that's great, do you know of anything to increase performance even further. You said TPL supports `async`, so how would that be implemented as well? – Jared Andrews Jan 09 '15 at 23:21
  • @JaredAndrews If that's an issue the `TransformBlock` can have a `BoundedCapacity` and the reader posts with `await block.SendAsync(buffer)`. If the capacity is reached the reader would asynchronously wait until the block clears up. – i3arnon Jan 09 '15 at 23:24