0

Scenario: I created a software that calculates the hash of a file, and compares it to a hash list file in my possession (about 1 mln - growing), currently in txt format. Which is the best way to make the comparison as fast as possibile?

I'm using this function:

Public HashList As New List(Of String)

Private Sub LoadHash()
   For Each hash As String In IO.File.ReadAllLines("C:\test\hash.txt")
      HashList.Add(hash)
   Next
End Sub

Private Function CheckFile(ByVal filename As String) As Boolean
   If HashList.Contains(MD5(filename)) Then
      Return True
   End If

   Return False
End Function

Any suggestions for improve this code? are there better methods?

croxy
  • 4,082
  • 9
  • 28
  • 46
  • Looking through a file with millions of hashes is not something that's easily optimized as it requires a lot of I/O and will be harder to fit in RAM as the file grows. Consider switching to a database instead. – Visual Vincent Dec 19 '18 at 17:14
  • 2
    I found another question in Python where the OP has benchmarked different methods of locating a string in a large text file. It might help you get some perspective on what you're up against :) https://stackoverflow.com/q/6219141 – Visual Vincent Dec 19 '18 at 17:22
  • I don't know which one would be best put there are better collection to use. For example, if you had stored the hash in a dictionary, the search would be O(1) instead of O(n) from a list. Take a look at the [other type of collections](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic?view=netframework-4.7.2). Maybe a HashSet – the_lotus Dec 19 '18 at 17:27
  • Is this an occasional one-off check? If so, you could have several files of the hashes with names like `"hash" & Math.Floor(Math.Log10(filesize + 1)) & ".txt"`. That way, there would be less data to read, which would be faster, and fewer items to look through, which would also be faster. – Andrew Morton Dec 19 '18 at 23:19

1 Answers1

1

Try using a better collection type like a HashSet. There are a lot in .NET that all have their use.

Public HashList As New HashSet(Of String)
the_lotus
  • 12,668
  • 3
  • 36
  • 53