0

I'm working on a PowerShell script to convert a docx to HTML, and also to change the encoding of the HTML, because by default it saves it as windows-1252.

I need this because later on I send this HTML saved as the body for an email also send by PowerShell. As I am Spanish I need accents and tildes to show up (those are appearing as ? right now).

I tried the SaveAs method with all the parameters, but I couldn't get it to work.

This is my script:

$MSWord = New-Object -ComObject Word.Application
$MSWord.Documents.Open(“C:\Users\USER\Videos\CAMBIO_TURNO.docx”)
$MSWord.Visible = $false

# Save HTML
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], “wdFormatHTML”);
$path = “C:\Users\USER\Videos\CAMBIO_TURNO.html”
$MSWord.ActiveDocument.SaveAs([ref]$path, [ref]$saveFormat)

# Close File

$MSWord.ActiveDocument.Close()
$MSWord.Quit()

Then, to send it to me, I use this other code on PowerShell:

$OutputEncoding = [System.Text.Encoding]::UTF8

$body = [IO.File]::ReadAllText(“C:\Users\USER\Videos\CAMBIO_TURNO.html”)

Send-MailMessage -To “EMAIL@EMAIL” -From “EMAIL@EMAIL” -Subject “CAMBIO” -Body $body -Encoding $OutputEncoding -BodyAsHtml -Attachments “C:\Users\USER\Videos\CAMBIO_TURNO.xlsx” -Dno onSuccess, onFailure -SmtpServer smtp.gmail.com -Credential EMAIL@EMAIL

SECOND UPDATE

(Although I went to the page that is marked as duplicate: Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS it didn't solve my problem. that word configuration doesn't work)

Here is what I tried after saving my document with the web options as utf-8:

#DEFINE outputencoding FOR THE CONSOLE - IT SEEMS THAT IT DOESN'T WORK. I typed ñ and ó and they appear as ?? becasue it doesn't convert the hexadecimal values to the right charset
$OutputEncoding= New-Object -typename System.Text.ASCIIEncoding

# Open word to add input into the signature file
$MSWord = New-Object -ComObject word.application
$MSWord.Documents.Open('C:\Users\USER\Videos\CAMBIO_TURNO.docx')

 # Save HTML
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], 'wdFormatFilteredHTML');

$path = 'C:\Users\USER\Videos\CAMBIO_TURNO.html'

$default = [Type]::Missing
$MSWord.ActiveDocument.SaveAs2([ref]$path, [ref]$saveFormat, [ref]$default, [ref]$default, [ref]$default, [ref]$default, [ref]$default, [ref]$default, [ref]$default, [ref]$default, [ref]$default, [ref]28591)

# Close File
$MSWord.ActiveDocument.Close()
$MSWord.Quit()

$HTMLw = Get-Content -Path 'C:\Users\USER\Videos\CAMBIO_TURNO.html' -Encoding ASCII -Force
$HTMLw -replace 'charset=windows-1252','charset=ISO-8859-1' | Set-Content -Path 'C:\Users\USER\Videos\CAMBIO_TURNO.html' -Encoding ASCII -Force
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Patricia
  • 1
  • 2

1 Answers1

0

For one thing you should avoid typographic quotes (). Always use straight quotes in code (").

With that said, the problem you're facing is most likely that passing a string with the name of a symbolic constant doesn't work. Either use the numeric value of the constant or define a constant yourself:

New-Variable -Name wdFormatHTML -Value 8 -Option Constant
$MSWord.ActiveDocument.SaveAs($path, $wdFormatHTML)

Alternatively you should be able to resolve the constants via the Interop API, but I don't have an Office installation at hand right now, so I can't test.

You also didn't specify the desired encoding of the output file when saving.

New-Variable -Name wdFormatHTML -Value 8 -Option Constant
$default = [Type]::Missing
$MSWord.ActiveDocument.SaveAs($path, $wdFormatHTML, $default, $default, $default, $default, $default, $default, $default, $default, $default, 65001)
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
  • Wow! I'm gonna try it! Thank you so much. It tried doing the variable thing, but I did it wrong I guess. I wrote `$def = [ref]::missing` and it didn't work, obviously. – Patricia Nov 22 '17 at 14:36
  • Hi! I've tried what you sent me. But it doesn't convert the html document into UFT8 or LATIN1 (when typing the code numer). This is the head of the HTML: ` ` & the encoding that notepad++ shows about the HTML is "Windows-1252", also. Thanks for your time, though – Patricia Nov 22 '17 at 15:58
  • Try saving the file as `wdFormatFilteredHTML` (numeric value 10). If that also doesn't adjust the meta tag accordingly you probably need to change the value in the exported HTML. – Ansgar Wiechers Nov 22 '17 at 16:11
  • Hello, First: thank you for all your help. I tried this `New-Variable -Name wdFormatHTML -Value 10 -Option Constant` but it always throws the same error (even if I delete the "-Option Constant" part ). ERROR: `can't overwrite wdFormatHTML because is read-only or constant`. So, then, I tried with this: `$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");` but at the end, the encoding and charset is always w1252. – Patricia Nov 23 '17 at 10:37
  • You can't re-define a constant. It wouldn't be all that constant if you could, would it? It doesn't have to be a constant anyway. Simply define a variable `$wdFormatFilteredHTML = 10` and use that if you want to be able to play around with the value. – Ansgar Wiechers Nov 23 '17 at 16:20