0

Does it make sense that a VB6 program that processes ~50,000 xml files runs about 3x faster [in terms of number of files processed per second] if the files are each about 30KB in size, than if the files are each about 4KB in size? If it does make sense, how can I speed up processing of the smaller files?

The program reads each file, computes an MD5 hash value for the file and calls a SQL Server stored procedure to see if a version of the file having the same hash value is already stored in a database. If the file's computed hash value is already stored in the database the file isn't processed further; the program just repeats with the next file.

I've been testing on batches of ~50,000 xml files that have already been processed into the database, so the program just loops: get the next xml filehash itcall Sprocrepeat.

I expected the program to run faster on similar-sized batches of smaller files, but it's significantly slower.

The program runs on a 64-bit Windows 10 workstation. The files are stored in a single directory on the workstation's C:\ drive (which is an SSD). SQL Server runs on a VM under Windows Server.

EDIT: I think I have found the bottleneck, but I don't know how to solve it. The relevant piece of code is below. The bottleneck is caused by the xDOC.async = False statement. If I remove it I get an immediate 20x speed improvement. BUT removing it causes document load failure errors since the code apparently can't handle asynchronous file loading. Can this be speeded up?

Dim objFSO As Object
Dim objFolder As Object
Dim objFile As Object
Dim xDOC As MSXML2.DOMDocument
Dim xPE As MSXML2.IXMLDOMParseError

Set objFolder = objFSO.GetFolder("C:\These are my XML Files")

For Each objFile In objFolder.Files

    Set xDOC = New DOMDocument
    xDOC.async = False    ("THIS LINE IS THE PROBLEM")

    If xDOC.Load(objFile.Path) Then

        /* process the file */

    Else
        Set xPE = xDOC.parseError
        With xPE
            /* set up objFile.Name failed to load error message */
        End With
            /* log error details */
        Set xPE = Nothing
    End If

    Set xDOC = Nothing

Next objFile
BRW
  • 187
  • 1
  • 10
  • Where is the time difference? In the reading, hashing or db call? – Alex K. Apr 30 '20 at 08:20
  • The system has a better chance to predict file access patterns the larger the file is. An XML parser will generally access the file in sequential order, a pattern the system is well prepared to optimize. By the time new data is requested it has been read into memory already with high probability. As far as I know, the system cannot optimize disk access across different files. For lots of small files you're going to observe the worst case performance. – IInspectable Apr 30 '20 at 08:51
  • The program is probably not parsing the file if it is doing an MD5 hash. A 4K file should be faster to read and run a hash on than a 30K file. If there are the same number of files then I don't think it would run slower. Now if it was the same amount of data, but split into many more smaller files then I may agree. Anyway, BRW you will need to show some code, because otherwise we are all just guessing. – tcarvin Apr 30 '20 at 13:30
  • The program parses the XML, but only if the SProc's hash comparison indicates that the same version of the XML file isn't already stored in the database; otherwise there's no need to re-parse the same version of the file. I'll do some further tests as suggested by Alex K and report back. – BRW Apr 30 '20 at 18:00
  • In regards to async, you didn't really see a 20x improvement, you are allowing your code to continue **before** the XML doc is fully parsed (it will happen in background). That's not apples to apples. – tcarvin May 01 '20 at 12:45
  • @tcarvin I take your point. I am also beginning to doubt that I really have have found the bottleneck. For all I know the culprit is something else entirely like missing or improperly configured database indexes. Maybe I should just delete this question (since it may just be misleading people), do a whole lot of further testing and experimentation, and then perhaps post a different question unless I manage to figure things out on my own. I certainly appreciate everyone's comments thus far. – BRW May 02 '20 at 03:37

0 Answers0