2

SCRIPT PURPOSE

The idea behind the script is to recursively extract the text from a large amount of documents and update a field in an Azure SQL database with the extracted text. Basically we are moving away from Windows Search of document contents to an SQL full text search to improve the speed.

ISSUE

When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows. Here is the section of the script that processes the files:

foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot "\FileLists\*" ) -include *.txt )) {

    ## Word object
    $word = New-Object -ComObject word.application
    $word.Visible = $false
    $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
    $word.DisplayAlerts = 0


    Write-Output ""
    Write-Output "################# Parsing $list"
    Write-Output ""

    $query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
              VALUES "

    foreach ($file in (Get-Content $list)) {
        if ($file -like "*-*" -and $file -notlike "*~*") {

            Write-Output "Processing: $($file)"
            Try {
                $doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, "ttt")

                if ($doc) {
                    $fileName = [io.path]::GetFileNameWithoutExtension($file)
                    $fileName = $filename + ".txt"
                    $doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
                    $doc.Close()


                    $4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
                    $text = Get-Content -raw "$env:TEMP\$fileName"
                    $text = $text.replace("'", "''")

                    $query += "
                    ('$text', $4ID),"
                    Remove-Item -Force "$env:TEMP\$fileName"

                    <# Upload to azure #>
                    $query = $query.Substring(0,$query.Length-1)
                    $query += ";"

                    Invoke-Sqlcmd @params -Query $Query -ErrorAction "SilentlyContinue"

                    $query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
                      VALUES "
                }
            }
            Catch {
                Write-Host "$($file) failed to process" -ForegroundColor RED;
                continue
            }
        }
    }
    Remove-Item -Force $list.FullName


    Write-Output ""
    Write-Output "Uploading to azure"
    Write-Output ""


    <# Upload to azure #>
    Invoke-Sqlcmd @params -Query $setQuery -ErrorAction "SilentlyContinue"


    $word.Quit()
    TASKKILL /f /PID WINWORD.EXE
}

Basically it parses through a folder of .txt files that contain x amount of document paths, creates a T-SQL update statement and runs against an Azure SQL database after each file is fully parsed. The files are generated with the following:

if (!($continue)) {
    if ($pdf){
        $files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
    }
    else {
        $files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
    }


    $files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
    $i=0; Get-Content $documentFile -ReadCount $interval | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}

The $interval variable defines how many files are set to be extracted for each given upload to azure. Initially i had the word object being created outside the loop and never closed until the end. Unfortunately this doesn't seem to work as every time the script hits a file it cannot open, every file that follows will fail, until it reaches the end of the inner foreach loop foreach ($file in (Get-Content $list)) {.

This means that to get the expected outcome i have to run this with an interval of 1 which takes far too long.

Owain Esau
  • 1,876
  • 2
  • 21
  • 34
  • As an aside, I would change `TASKKILL /f /PID WINWORD.EXE` into `[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null; [System.GC]::Collect(); [System.GC]::WaitForPendingFinalizers()` to release the Com object from memory – Theo Dec 19 '18 at 12:16

2 Answers2

1

This is a shot in the dark

But to me it sounds like the reason its failing is because the Word Com object is now prompting you for some action due since it cannot open the file so all following items in the loop also fail. This might explain why it works if you set the $Interval to 1 because when its 1 it is closing and opening the Com object every time and that takes forever (I did this with excel).

What you can do is in your catch statement, close and open a new Word Com object which should lets you continue on with the loop (but it will be a bit slower if it needs to open the Com object a lot).

If you want to debug the problem even more, set the Com object to be visible, and slowly loop through your program without interacting with Word. This will show you what is happening with Word and if there are any prompts that are causing the application to hang.

Of course, if you want to run it at full speed, you will need to detect which documents you can't open before hand or you could multithread it by opening several Word Com objects which will allow you to load several documents at a time.

Shadowzee
  • 537
  • 3
  • 15
  • Thankyou for this, the idea of setting word to visible is a good one ill give this a test. – Owain Esau Dec 19 '18 at 05:11
  • The main issue was the encoding of some of the files, i had to change the document open call to: `$doc = $word.Documents.Open($file, $false, $false, $false, "ttt", $def, $def, $def, $def, $def, 50001 )` issue fixed after this change. Thanks for the help – Owain Esau Jan 07 '19 at 03:14
0

As for...

ISSUE

When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows.

... then test for this as noted here...

How to check if a word file has a password?

$filename = "C:\path\to\your.doc"

$wd = New-Object -COM "Word.Application"

try {
  $doc = $wd.Documents.Open($filename, $null, $null, $null, "")
} catch {
  Write-Host "$filename is password-protected!"
}

... and skip the file to avoid the failure of the remaining files.

postanote
  • 15,138
  • 2
  • 14
  • 25
  • I am already doing that : `Try { $doc = $word.Documents.OpenNoRepairDialog($file.path, $false, $false, $false, "ttt") Catch { Write-Host "$($file) failed to process" -ForegroundColor RED;` – Owain Esau Dec 19 '18 at 01:56
  • also the password protected file was just an example, there are other reasons as to why the document wouldnt process. – Owain Esau Dec 19 '18 at 02:02
  • Allllrighty, then, so, outside of password and file lock because it's in use, or hitting a location where you don't have permissions, what other blocking scenarios would you encounter? Thus trapping all options that could be encountered. – postanote Dec 19 '18 at 05:48