0

I have a "container" containing data. The size is +- 100MB. In the container there a several "dataids's" that mark the begin of something.

Now I need to get an index for an given dataid. (dataid for example: '4CFE7197-0029-006B-1AD4-000000000012')

I have tried several approaches. But at this moment "ReadAllBytes" is the most performant.

ReadAll -> average of 0.6 seconds

Using oReader As New BinaryReader(File.Open(sContainerPath, FileMode.Open, FileAccess.Read))
    Dim iLength As Integer = CInt(oReader.BaseStream.Length)
    Dim oValue As Byte() = Nothing
    oValue = oReader.ReadBytes(iLength)
    Dim enc As New System.Text.ASCIIEncoding
    Dim sFileContent As String = enc.GetString(oValue)

    Dim r As Regex = New Regex(sDataId)
    Dim lPosArcID As Integer = r.Match(sFileContent).Index
    If lPosArcID > 0 Then
        Return lPosArcID
    End If
End Using

ReadByteByByte -> average of 1.4 seconds

Using oReader As BinaryReader = New BinaryReader(File.Open(sContainerPath, FileMode.Open, FileAccess.Read))
    Dim valueSearch As StringSearch = New StringSearch(sDataId)

    Dim readByte As Byte
    While (InlineAssignHelper(readByte, oReader.ReadByte()) >= 0)
        index += 1
        If valueSearch.Found(readByte) Then
            Return index - iDataIdLength
        End If
    End While
End Using



Public Class StringSearch
    Private ReadOnly oValue() As Byte
    Private iValueIndex As Integer = -1

    Public Sub New(value As String)
        Dim oEncoding As New System.Text.ASCIIEncoding
        Me.oValue = oEncoding.GetBytes(value)
    End Sub

    Public Function Found(oNextByte As Byte) As Boolean

        If oValue(iValueIndex + 1) = oNextByte Then
            iValueIndex += 1

            If iValueIndex + 1 = oValue.Count Then Return True
        Else
            iValueIndex = -1
        End If

        Return False
    End Function
End Class

Public Function InlineAssignHelper(Of T)(ByRef target As T, ByVal value As T) As T
    target = value
    Return value
End Function

I find it hard to believe that there is no faster way. 0.6 seconds for a 100MB file is not an acceptable time.

An other approach that I tried, is to split in chuncks of X bytes (100, 1000, ..). But was alot slower.

Any help on an approach I can try?

Stinus
  • 309
  • 1
  • 3
  • 18
  • 1
    Have you tried `File.ReadAllText/File.ReadAllLines` and then `r.Match(sFileContent).Index`? Also if you want just the first index, why not using `IndexOf`? – varocarbas Nov 22 '13 at 16:27
  • The file also has binary content. That's why I'm using the BinaryReader. The regex match is alot faster then IndexOf – Stinus Nov 22 '13 at 17:40
  • Then File.ReadAllBytes -> I usually disadvise the utilisation of File.ReadAll alternatives but if the file size is completely under control and the speed is so, so important; it seems better to not make so many calls, declare so many variables. Regarding " The regex match is alot faster then IndexOf" is it just your impression, do you have any reference or are you saying this because you have done some tests under your conditions? – varocarbas Nov 22 '13 at 17:46

0 Answers0