need a faster way to match these data sets

Question

I have a set of Excel sheets, each set up as follows:

ID | imageName
--------------
1    abc.jpg
2    def.bmp
3    abc.jpg
4    xyz123.jpg

This sheet corresponds to a folder with contents like:

abc.pdf
ghijkl.pdf
def.pdf
def.xls
x-abc.pdf

I'm trying to generate a report that matches the instance of each imageName with the lowest ID with the PDFs that match it, and also identifies unmatched imageName in the sheet and unmatched PDFs in the folder. A filename with an "x-" prefix is equivalent to one without the prefix, so the report for this data set would be as follows:

ID  imageName   filename
-----------------------
1   abc.jpg     abc.pdf
1   abc.jpg     x-abc.pdf
2   def.bmp     def.pdf
4   xyz123.jpg 
                ghijkl.pdf

My current solution is as follows:

'sheetObj is the imageName set, folderName is the path to the file folder
sub makeReport(sheetObj as worksheet,folderName as string)

dim fso as new FileSystemObject
dim imageDict as Dictionary
dim fileArray as variant
dim ctr as long


'initializes fileArray for storing filename/imageName pairs
redim fileArray(1,0) 

'returns a Dictionary where key is imageName and value is lowest ID for that imageName
set imageDict=lowestDict(sheetObj)

'checks all files in folder and populates fileArray with their imageName matches
for each file in fso.getfolder(folderName).files
 fileFound=false
 'gets extension and checks if it's ".pdf"
 if isPDF(file.name) then 
  for each key in imageDict.keys
   'checks to see if base names are equal, accounting for "x-" prefix
   if equalNames(file.name,key) then 
    'adds a record to fileArray mapping filename to imageName
    addToFileArray fileArray,file.path,key  
    fileFound=true
   end if
  next
  'checks to see if filename did not match any dictionary entries
  if fileFound=false then 
   addToFileArray fileArray,file.path,""
  end if
 end if
next

'outputs report of imageDict entries and their matches (if any)
for each key in imageDict.keys
 fileFound=false
 'checks for all fileArray matches to this imageName
 for ctr=0 to ubound(fileArray,2)
  if fileArray(0,ctr)=key then
   fileFound=true
   'writes the data for this match to the worksheet
   outputToExcel sheetObj,key,imageDict(key),fileArray(0,ctr)
  end if
 next
 'checks to see if no fileArray match was found
 if fileFound=false then
  outputToExcel sheetObj,key,imageDict(key),""
 end if
next

'outputs unmatched fileArray entries
for ctr=0 to ubound(fileArray,2)
  if fileArray(1,ctr)="" then
   outputToExcel sheetObj,"","",fileArray(0,ctr)
  end if
next

This program outputs the report successfully, but it's very slow. Because of the nested For loops, as the number of imageName entries and files grows, the time to process them grows exponentially.

Is there a better way to check for matches in these sets? It might be faster if I make fileArray into a Dictionary, but a dictionary can't have duplicate keys, and this data structure needs to have duplicate entries in its fields, as a filename may match multiple imageNames and vice versa.

Actually, it grows *geometrically*, not exponentially. – RBarryYoung Jul 19 '12 at 14:59 — RBarryYoung, Jul 19 '12 at 14:59

score 0 · Answer 1 · edited May 23 '17 at 10:09

this should find the first one pretty quickly. you can do whatever you want at the inside of that last if statement. It uses an ADO recordset which should be faster than nested for loops

Sub match()
Dim sheetName As String: sheetName = "Sheet1"
Dim rst As New ADODB.Recordset
Dim cnx As New ADODB.Connection
Dim cmd As New ADODB.Command

    'setup the connection
    '[HDR=Yes] means the Field names are in the first row
    With cnx
        .Provider = "Microsoft.Jet.OLEDB.4.0"
        .ConnectionString = "Data Source='" & ThisWorkbook.FullName & "'; " & "Extended Properties='Excel 8.0;HDR=Yes;IMEX=1'"
        .Open
    End With

    'setup the command
    Set cmd.ActiveConnection = cnx
    cmd.CommandType = adCmdText
    cmd.CommandText = "SELECT * FROM [" & sheetName & "$]"
    rst.CursorLocation = adUseClient
    rst.CursorType = adOpenDynamic
    rst.LockType = adLockOptimistic

    'open the connection
    rst.Open cmd

    Dim fso As FileSystemObject: Set fso = New FileSystemObject
    Dim filesInFolder As files, f As File
    Set filesInFolder = fso.GetFolder("C:\Users\Bradley\Downloads").files

    For Each f In filesInFolder
        rst.MoveFirst
        rst.Find "imageName = '" & f.Name & "'", , adSearchForward
        If Not rst.EOF Then
            Debug.Print rst("imagename") & "::" & rst("ID") '<-- Do what you need to do here
        End If
    Next f

End Sub

FYI: I referenced this post

score 0 · Answer 2 · answered Jul 18 '12 at 23:43

Another way.

Sub Sample()
    Dim ws As Worksheet, wstemp As Worksheet
    Dim FileAr() As String
    Dim n As Long, wsLRow As Long

    Set ws = Sheets("Sheet1") '<~~ Which has imageNames   
    wsLRow = ws.Range("A" & ws.Rows.Count).End(xlUp).Row

    n = 0

    strFile = Dir("C:\Temp\*.*")

    Do While strFile <> ""
        n = n + 1
        ReDim Preserve FileAr(n)

        If Mid(strFile, Len(strFile) - 3, 1) = "." Then
            FileAr(n) = Mid(strFile, 1, Len(strFile) - 4)
        ElseIf Mid(strFile, Len(strFile) - 4, 1) = "." Then
            FileAr(n) = Mid(strFile, 1, Len(strFile) - 5)
        Else
            FileAr(n) = strFile
        End If

        strFile = Dir
    Loop

    Set wstemp = Worksheets.Add
    wstemp.Range("A1").Resize(UBound(FileAr) + 1, 1).Value = Application.Transpose(FileAr)

    ws.Range("B1:B" & wsLRow).Formula = "=IF(ISERROR(VLOOKUP(A1," & wstemp.Name & _
                                        "!A:A,1,0)),"""",VLOOKUP(A1," & wstemp.Name & "!A:A,1,0))"

    ws.Range("B1:B" & wsLRow).Value = ws.Range("B1:B" & wsLRow).Value

    Application.DisplayAlerts = False
    wstemp.Delete
    Application.DisplayAlerts = True
End Sub

Good thought, avoiding the nested loops makes the run time `O(ImageNames) + O(Files)` rather than `O(ImageNames * Files)` which will be a big improvement for large data sets. But misses a couple of points in the OP. Some image names are repeated in the files (eg `abc.jpg` and `x-abc.jpg`) and that row needs to be repeated in the output. Not all files need be considered (eg `def.xls`. Unmatched files get added on the end of the report (eg `ghijkl.pdf`). — chris neilsen, Jul 19 '12 at 00:09

score 0 · Answer 3 · answered Jul 19 '12 at 16:53

Thanks for the responses.

I ended up solving this by making an array of the filenames in folderName, using the WinAPI FindFirstFile and FindNextFile functions to go through the folder, because it's over a network so iterating through the collection returned by fso.getfolder(foldername).files was too slow.

I then made a filename/basename dictionary from the filename array, as:

key         | value
-----------------------
abc.pdf     | abc
x-lmnop.pdf | lmnop
x-abc.pdf   | abc

From this dictionary I made a reverse dictionary fileConcat that concatenated keys from duplicate basenames, as:

key         | value
-----------------------
abc         | abc.pdf,x-abc.pdf
lmnop       | lmnop.pdf

I was then able to match the basename for each imageDict key to a key in fileConcat, and then iterate through an array of the concatenated values generated by:

split(fileConcat(key))

where key is the basename of the imageDict key.

As @chrisneilsen commented, eliminating the nested For loops reduces the growth rate to O(ImageNames)+O(Files), and the function now performs at a satisfactory speed.

need a faster way to match these data sets

3 Answers3