0

I have drafted a PowerShell script that searches for a string among a large number of Word files. The script is working fine, but I have around 1 GB of data to search through and it is taking around 15 minutes.

Can anyone suggest any modifications I can do to make it run faster?

Set-StrictMode -Version latest
$path     = "c:\Tester1"
$output   = "c:\Scripts\ResultMatch1.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "Roaming"
$charactersAround = 30
$results = @()

Function getStringMatch
{

For ($i=1; $i -le 4; $i++) {
$j="D"+$i 
$finalpath=$path+"\"+$j
$files    = Get-Childitem $finalpath -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }    
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
    $document = $application.documents.open($file.FullName,$false,$true)
    $range = $document.content

    If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
         $properties = @{
            File = $file.FullName
            Match = $findtext
            TextAround = $Matches[0] 
         }
         $results += New-Object -TypeName PsCustomObject -Property $properties
       $document.close()  
    }


}

}


If($results){
    $results | Export-Csv $output -NoTypeInformation
}

$application.quit()

}

getStringMatch

import-csv $output
henrycarteruk
  • 12,708
  • 2
  • 36
  • 40
  • 4
    This question might be more suitable here: http://codereview.stackexchange.com/ – Mark Wragg Apr 07 '17 at 11:07
  • Have you considered enabling content indexing on that folder? I believe you can then query the windows search index from PowerShell. See: http://www.wikihow.com/Make-Windows-7-Search-File-Contents and https://www.petri.com/how-to-query-the-windows-search-index-using-sql-and-powershell – Eiríkur Fannar Torfason Apr 07 '17 at 11:49
  • Consider [using the OpenXML SDK](http://stackoverflow.com/questions/33291871/optimize-word-document-keyword-search/33292003#33292003) rather than opening the documents with Word – Mathias R. Jessen Apr 07 '17 at 12:27
  • Also, what is the outer `for` loop about? Why do you run the comparison 4 times? – Mathias R. Jessen Apr 07 '17 at 12:37

1 Answers1

0

As mentioned in comments, you might want to consider using the OpenXML SDK library (you can also get the newest version of the SDK on GitHub), since it's way less overhead than spinning up an instance of Word.

Below I've turned your current function into a more generic one, using the SDK and with no dependencies on the caller/parent scope:

function Get-WordStringMatch
{
    param(
        [Parameter(Mandatory,ValueFromPipeline)]
        [System.IO.FileInfo[]]$Files,
        [string]$FindText,
        [int]$CharactersAround
    )

    begin {
        # import the OpenXML library
        Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll' |Out-Null

        # make a "shorthand" reference to the word document type
        $WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]

        # construct the regex pattern
        $Pattern = ".{$CharactersAround}$([regex]::Escape($FindText)).{$CharactersAround}"
    }

    process {
        # loop through all the *.doc(x) files
        foreach ($File In $Files)
        {
            # open document, wrap content stream in streamreader 
            $Document       = $WordDoc::Open($File.FullName, $false)
            $DocumentStream = $Document.MainDocumentPart.GetStream()
            $DocumentReader = New-Object System.IO.StreamReader $DocumentStream

            # read entire document
            if($DocumentReader.ReadToEnd() -match $Pattern)
            {
                # got a match? output our custom object
                New-Object psobject -Property @{
                    File = $File.FullName
                    Match = $FindText
                    TextAround = $Matches[0] 
                }
            }
        }
    }

    end{
        # Clean up
        $DocumentReader.Dispose()
        $DocumentStream.Dispose()
        $Document.Dispose()
    }
}

Now that you have a nice function that supports pipeline input, all you need to do is gather your documents and pipe them to it!

# variables
$path     = "c:\Tester1"
$output   = "c:\Scripts\ResultMatch1.csv"
$findtext = "Roaming"
$charactersAround = 30

# gather the files
$files = 1..4|ForEach-Object {
    $finalpath = Join-Path $path "D$i"
    Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and @('*.docx','*.doc' -contains $_.Extension)}
}

# run them through our new function
$results = $files |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround

# got any results? export it all to CSV
if($results){
    $results |Export-Csv -Path $output -NoTypeInformation
}

Since all of our components now support pipelining, you could do it all in one go:

1..4|ForEach-Object {
    $finalpath = Join-Path $path "D$i"
    Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and @('*.docx','*.doc' -contains $_.Extension)}
} |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround |Export-Csv -Path $output -NoTypeInformation
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206