0

I have the following part of a script that converts a Word document (Previously converted from a PDF) to a text file. This is usually a function as part of a larger script but for the purposes of this question this is fine.

Sub GetTextFromWord()

    Dim fso As FileSystemObject
    Dim oWd As Object, oDoc As Object
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set oDoc = oWd.Documents.Open("C:\temp\PDFs\XFA006HH - Granular Sulphamic acid - Univar - 19-05-2021.pdf.doc")

    filePath = "C:\temp\PDFs\" & "TEST" & ".txt"  'filename
    Debug.Print filePath
    'open text stream as unicode
    Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                
    fileStream.Write oDoc.Range.Text
    fileStream.Close
    oDoc.Close

    oWd.Quit

End Sub

The TEST file generated is okay however lacks the subsection numbers that would normally be present. enter image description here

When I generate the text file manually open the word doc. (File Export > Change file type > plain text (save). With options Windows Default selected, Insert line breaks unticked and allows for character substitution.

enter image description here

The generated text file is as desired.

enter image description here

When I record a macro in word for the same steps, I get the following script:

Sub Macro2()

' Macro2 Macro
'
'
    ActiveDocument.SaveAs2 FileName:= _
        "XFA006HH - Granular Sulphamic acid - Univar - 19-05-2021.pdf.txt", _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
End Sub

I would like to modify the first script to incorporate these parameters (mainly InsertLineBreaks:=False, AllowSubstitutions:=True - unsure if the others are required to generate the text file as exact). Ideally, I can incorporate as many as feasible to play around with and see the effect of the file generated. Things like LockComments:=False, Password:="" are not required.

How can I incorporate the script to achieve this?

fso.CreateTextFile doesn't appear to give such options so I wonder if I need to rethink this.

Link to Doc file:

https://1drv.ms/u/s!AsrLaUgt0KCLhiPc1u_vlYjFfsev?e=nlFn76

Update:

enter image description here

Nick
  • 789
  • 5
  • 22
  • `CreateTextFile` does not have/accept such a parameter. This is what is able to do, unfortunately... You should proceed viceversa: To incorporate `SaveAs2` **instead of the VBScript object method**, which is faster but has some limitations... – FaneDuru Sep 09 '22 at 11:15

1 Answers1

2

Please, try the next updated code. It replaces the VBScript object method with the one you tested:

Sub GetTextFromWord()
    Dim fso As FileSystemObject
    Dim oWd As Object, oDoc As Object
    Const wdFormatText as Long = 2, wdCRLF as Long = 0

    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")

    Set oDoc = oWd.Documents.Open("C:\temp\PDFs\XFA006HH - Granular Sulphamic acid - Univar - 19-05-2021.pdf.doc")

    Dim filePath As String: filePath = "C:\temp\PDFs\" & "TEST" & ".txt"  'filename
    Debug.Print filePath
    
    oDoc.SaveAs2 fileName:=filePath, _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
        
    oDoc.Close False
    oWd.Quit
End Sub
FaneDuru
  • 38,298
  • 4
  • 19
  • 27
  • Hi this generates the Text file, but the text is completely encoded and unreadable. Not sure why this is happening as you have used the parameters as exactly given – Nick Sep 09 '22 at 11:42
  • @KJ Hm Nope still happening – Nick Sep 09 '22 at 11:51
  • Please see the update. [Content_Types].xml at the top? – Nick Sep 09 '22 at 11:53
  • @Nick Do you have any reference to the Word object? Please try (in Excel), in a separate testing sub: `Debug.Print wdFormatText`. What does it return in Immediate Window? Is it 2? – FaneDuru Sep 09 '22 at 11:59
  • @FaneDuru Not getting anything returned for `Debug.Print wdFormatText` in the immediate window – Nick Sep 09 '22 at 12:08
  • @Nick This is good... I will adapt the code to 'inform' it about the used constants value. The code works in my environment and returned as it should. But I have a reference to `Word` object and it 'understood' the used constants value... I will do it in a minute. Please, try the updated code. Refresh the page (this one) to be sure that you are using the updated code. it only has a new code line: `Const wdFormatText as Long = 2, wdCRLF as Long = 0`... If there are special characters which looks to be transformed in something strange, try using `65001` (UTF-8) for `Encoding` parameter. – FaneDuru Sep 09 '22 at 12:15
  • 1
    Appears to be working as required. Thank you! – Nick Sep 09 '22 at 12:36
  • @KJ I dont think this is an issue? Im still interested in using Poppler but this had proven very useful to my purpose – Nick Sep 09 '22 at 14:02
  • 1
    oh gosh i must have uploaded the wrong one, still working though – Nick Sep 09 '22 at 14:21