1

I have been developing an antivirus using vb.net. The virus scanner works fine but I was thinking of ways to optimize the scanning speed (because large files take forever).

The algorithm I'm using to detect the viruses is via binary (converted to hex) signatures. I think I don't have to look around the whole file just to find if it's a virus or not, I think there's a specific place and a specific number of bytes that I should scan instead of scanning the whole file. Anyway, if anyone can provide any help in this subject please do so.

Thanks in advance.

BTW the virus signatures come from the hex collection from the clamAv antivirus...

Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
Seif Shawkat
  • 237
  • 2
  • 15
  • Get a faster disk. In lieu of that, make it unobtrusive to the user. Definitely don't nag with "look I'm scanning!" animations. – Hans Passant Mar 14 '11 at 21:19
  • @HansPassant: Actually, till now it just consists of a couple of labels and 2 buttons. What I'm trying to speed up is the scanning process (which searches all of the bytes in the file). – Seif Shawkat Mar 14 '11 at 21:23
  • 1
    @Seif: It sounds like you need to know where exactly in a file to match for signatures. Doesn't clamAv provide this info? – Paul Sasik Mar 14 '11 at 21:25
  • @PaulSasik: That's what I thought, but after uncompressing the main.cvd file and opening the main.db file, all I found was the name and the hex signature – Seif Shawkat Mar 14 '11 at 21:28
  • You're missing the point. To read files off a hard disk faster, you'll need a hamster that spins it faster. They cost money. *Never* let a user see how long it takes to read a terabyte of data. – Hans Passant Mar 14 '11 at 21:30
  • @HansPassant: i'm trying to make a successful program that scans fast, the thing that needs changing is the engine (maybe the signatures too although i'm not too sure) i dont want to cheat the user by telling them the scan is finished when it's still in the beginning of the file. Plus, reading the file takes much less time than scanning it – Seif Shawkat Mar 14 '11 at 21:42
  • @HP: He's trying to figure out if he can just scan a part of a file, a known address, to match a virus signature rather than an entire file. i think... @Seif: You might be better off asking the clamAV people directly whether or not this info exists. – Paul Sasik Mar 14 '11 at 21:43
  • @PaulSasik: exactly! I' ll try getting answers from then and post back with updates. – Seif Shawkat Mar 14 '11 at 21:45
  • Is it taking a long time because you are checking for hundreds or thousands of possible codes? If so, maybe the optimization you need is for scanning for multiple codes rather than for scanning large portions of a file. – BlueMonkMN Mar 15 '11 at 20:32
  • @BlueMonkMN: Even if i'm scanning for only one signature, it takes long periods of time to scan 5,000,000 bytes (about 5 mbytes). And I don't think I have to scan the whole 5,000,000 bytes... – Seif Shawkat Mar 16 '11 at 00:15
  • BTW, till now I didn't get a response from clamAv... Maybe I just have to wait a bit longer? – Seif Shawkat Mar 16 '11 at 00:17
  • Because some viruses are polymorphic, you may have to scan the entire file for certain signatures. Also, I suspect many viruses can infect a variety of files in ways that don't allow you to scan just one particular offset. I think you will usually have to scan the whole file. – BlueMonkMN Mar 16 '11 at 13:12

2 Answers2

1

Well it all depends, What is definition of virus signature ?
I Suggest you to parse executable and use only code-section.
But polymorphic virus keeps there malicious code in data-section in encrypted form. So I am not very much sure.
Are you using some kind of n-gram technique? Or just mining frequent Hex-Codes?
Scan time is very important issue!
Once i have written a command line saner, that was able to find a file in less than a second -infect tons of files in a seconds.
The technique was frequent opcode mining.

Rahul Gautam
  • 4,749
  • 2
  • 21
  • 30
Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
  • The technique I'm using (or was using since I discontinued this project long ago) was a search-and-find technique. I first converted the file-to-be-scanned to hex and then searched for hex strings (obtained from an array of hex strings or signatures) in the converted hex code. I was able to scan small files (a couple of KBs) in less than a second, however it took more time to search through larger files. – Seif Shawkat Oct 07 '12 at 12:30
  • @SeifShawkat : Use the AVL tree use to represent you file in memory..may be useful to you. – Grijesh Chauhan Oct 08 '12 at 03:43
0

Perhaps your pattern scan is inefficient. I can scan for a pattern in a 7 MB file in about 1/20th of a second using code like this. Note, if you really want to use code like this, you have to make a correction. You can't always set MatchedLength back to 0 when you realize that you aren't looking at a match, but it does work for this particular pattern. You have to pre-process the pattern so you know what to reset to when you don't find a match, but that will not add significant time to the algorithm. I could make the effort to correctly complete the algorithm, but I won't do that now if your question is just about performance. I'm just demonstrating that it is possible to scan large files quickly if you do it correctly.

Sub Main(ByVal args As String())
  If args.Length < 1 Then Return
  Dim startTime As Long = Stopwatch.GetTimestamp()
  Dim pattern As Byte()
  pattern = System.Text.Encoding.UTF8.GetBytes("SFMB")
  Dim bufferSize As Integer = 4096
  Using reader As New System.IO.FileStream(args(0), IO.FileMode.Open, _
     Security.AccessControl.FileSystemRights.Read, IO.FileShare.Read, bufferSize, IO.FileOptions.SequentialScan)
     Dim buffer(bufferSize - 1) As Byte
     Dim readLength = reader.Read(buffer, 0, bufferSize)
     Dim matchedLength As Integer = 0
     Dim searchPos As Integer = 0
     Dim fileOffset As Integer = 0
     Do While readLength > 0
        For searchPos = 0 To readLength - 1
           If pattern(matchedLength) = buffer(searchPos) Then
              matchedLength += 1
           Else
              matchedLength = 0
           End If
           If matchedLength = pattern.Length Then
              Console.WriteLine("Found pattern at position {0}", fileOffset + searchPos - matchedLength + 1)
              matchedLength = 0
           End If
        Next
        fileOffset += readLength
        readLength = reader.Read(buffer, 0, bufferSize)
     Loop
  End Using
  Dim endTime As Long = Stopwatch.GetTimestamp()
  Console.WriteLine("Search took {0} seconds", (endTime - startTime) / Stopwatch.Frequency)
End Sub

EDIT

Here are some thoughts about how you could match multiple patterns at once. This is just off the top of my head and I have not tried to compile the code:

Create a class to contain information about the status of a pattern:

Class PatternInfo
   Public pattern As Byte()
   Public matchedBytes As integer
End Class

Declare a variable to track all the patterns that you need to check and index them by the first byte of the pattern for quick lookup:

Dim patternIndex As Dictionary(Of Byte, IEnumerable(Of PatternInfo))

Check all the patterns that are currently a potential match to see if the next byte also matches on these patterns; if not, stop looking at that pattern at that position:

Dim activePatterns As New LinkedList(Of PatternInfo)
Dim newPatterns As IEnumerable(Of PatternInfo)

For Each activePattern in activePatterns.ToArray
   If activePattern.pattern(matchedBytes) = buffer(searchPos) Then
      activePattern.matchedBytes += 1
      If activePattern.matchedBytes >= activePattern.pattern.Length Then
         Console.WriteLine("Found pattern at position {0}", searchPos - matchedBytes + 1)
      End If
   Else
      activePatterns.Remove(activePattern)
   End If
Next

See if the current byte looks like the beginning of a new pattern that you would be searching for; if so, add it to the list of active patterns:

If patternIndex.TryGetValue(buffer(searchPos), newPatterns) Then
   For Each newPattern in newPatterns
      activePatterns.Add(New PatternInfo() With { _
         .pattern = newPattern.pattern, .matchedBytes = 1 }
   Next
End If
BlueMonkMN
  • 25,079
  • 9
  • 80
  • 146
  • You could improve the efficiency even more by using even better algorithms such as Boyer-Moore: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm – BlueMonkMN Mar 16 '11 at 13:19
  • That is slower than the code I have... ` Public Function FindSequence(ByVal list() As Byte, ByVal value() As Byte)As Integer Dim startIndex As Integer = Array.IndexOf(list, value(0)) Do Until startIndex = -1 OrElse list.Length - startIndex < value.Length Dim runLength As Integer = 0 For index As Integer = 0 To value.Length - 1 If value(index) <> list(startIndex + index) Then Exit For runLength += 1 Next If runLength = value.Length Then Return startIndex startIndex = Array.IndexOf(list, value(0), startIndex + runLength) Loop Return -1 End Function ` – Seif Shawkat Mar 16 '11 at 16:43
  • @SeifShawkat Did you try it? I don't know how it could be slower, it does less looping than your code. And it runs in a fraction of a second. Does yours? – BlueMonkMN Mar 16 '11 at 17:13
  • Well, actually you were a bit right in the first place, it's a bit slow because it's searching for 28,784 virus signatures... – Seif Shawkat Mar 16 '11 at 18:01
  • @BlueMonkMN: Mine took less than 1 second to scan a 13 byte file (with all the signatures), yours somehow took up to 4 seconds. – Seif Shawkat Mar 16 '11 at 18:03
  • There must have been something lost in translation. It took less than 1 second when I tried this on a 7 MB file. – BlueMonkMN Mar 16 '11 at 20:41
  • Yes, but thats only for one signature. Did you try 20,000 signatures? :p – Seif Shawkat Mar 17 '11 at 05:14
  • OK I missed one of your comments about multiple signatures. Now I understand. Perhaps what you need is some way to efficiently scan for all signatures at once. Are you scanning each signature separately? I think you should scan all signatures at once because then, for example, if you know that none of your signatures start with "G" then you can skip checking any of your signatures when you see a "G". You could create an tree structure to help you know which signatures to check based on the character you are scanning. – BlueMonkMN Mar 17 '11 at 11:07
  • But how could I do that? The thing is that I have to scan the whole file? – Seif Shawkat Mar 17 '11 at 22:32
  • I can't be certain that you need to scan the whole file, but I suspect that you might need to scan most of the file, and if that's the case, you need to do it efficiently. I think you could do this by creating a structure to optimize your scan. Start with a Dictionary object that maps each character to an array of pattern objects that stat with that character. When you scan a character from the file, look up that dictionary entry. If it doesn't exist, skip to the next character in the file. If there is an entry, add it to a list of patterns that may match. I can comment more if nicessary... – BlueMonkMN Mar 18 '11 at 14:49
  • See where I marked "EDIT" in my answer. I added some details about how you might more optimally search multiple patterns. – BlueMonkMN Mar 18 '11 at 18:41