SCRIPT PURPOSE
The idea behind the script is to recursively extract the text from a large amount of documents and update a field in an Azure SQL database with the extracted text. Basically we are moving away from Windows Search of document contents to an SQL full text search to improve the speed.
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows. Here is the section of the script that processes the files:
foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot "\FileLists\*" ) -include *.txt )) {
## Word object
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
Write-Output ""
Write-Output "################# Parsing $list"
Write-Output ""
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
foreach ($file in (Get-Content $list)) {
if ($file -like "*-*" -and $file -notlike "*~*") {
Write-Output "Processing: $($file)"
Try {
$doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, "ttt")
if ($doc) {
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "''")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
Invoke-Sqlcmd @params -Query $Query -ErrorAction "SilentlyContinue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
continue
}
}
}
Remove-Item -Force $list.FullName
Write-Output ""
Write-Output "Uploading to azure"
Write-Output ""
<# Upload to azure #>
Invoke-Sqlcmd @params -Query $setQuery -ErrorAction "SilentlyContinue"
$word.Quit()
TASKKILL /f /PID WINWORD.EXE
}
Basically it parses through a folder of .txt files that contain x amount of document paths, creates a T-SQL update statement and runs against an Azure SQL database after each file is fully parsed. The files are generated with the following:
if (!($continue)) {
if ($pdf){
$files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
}
else {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
}
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount $interval | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
The $interval variable defines how many files are set to be extracted for each given upload to azure. Initially i had the word object being created outside the loop and never closed until the end. Unfortunately this doesn't seem to work as every time the script hits a file it cannot open, every file that follows will fail, until it reaches the end of the inner foreach loop foreach ($file in (Get-Content $list)) {
.
This means that to get the expected outcome i have to run this with an interval of 1 which takes far too long.