Extracting whole Sentences from PDFs (as best as possible) - Plain Text From PDF without inserting line breaks

Question

I believe I have finally come up with a way to extract plain text without line breaks whilst retaining intended carriage returns from PDFs using VBA, Acrobat and Word Combined.

Previous answers using either word or acrobat independently ran into their own issues. Word would occasionally omit text interpreted as images, and Acrobat sometimes would not handle complex structures of PDFs and generate a blank text file.

Having tinkered with word, I realise that it has the option to generate plain text without linebreaks as shown below. Importantly the text generated retains intended carriage returns.

Acrobat does this automatically, too, when generating a plain text file; however, with the issue of unstructured PDFs, I think word is the better bet. And also likely more controllably with VBA.

By combining the two in VBA, I believe I have omitted many of the issues. The text files generated are much more than what I have been after for the past few days. i.e. sentences are not broken with line breaks.

The VBA code below works as follows:

Convert all PDFs contained within a folder to word (using acrobat to ensure no part of the PDF is omitted)
Use words to achieve the conversion to plain text.

Update: 21/12/22 The below code uses FileFormat:=wdFormatText which maybe more straight forward.

Sub ConvertDocumentsToTxt()
'Updated by Extendoffice 20181123
    Dim xIndex As Long
    Dim xFolder As Variant
    Dim xFileStr As String
    Dim xFilePath As String
    Dim xDlg As FileDialog
    Dim xActPath As String
    Dim xDoc As Document
    Application.ScreenUpdating = False
    Set xDlg = Application.FileDialog(msoFileDialogFolderPicker)
    If xDlg.Show <> -1 Then Exit Sub
    xFolder = xDlg.SelectedItems(1)
    xFileStr = Dir(xFolder & "\*.doc")
    xActPath = ActiveDocument.Path
    While xFileStr <> ""
        xFilePath = xFolder & "\" & xFileStr
        If xFilePath <> xActPath Then
            Set xDoc = Documents.Open(xFilePath, AddToRecentFiles:=False, Visible:=False)
            xIndex = InStrRev(xFilePath, ".")
            Debug.Print Left(xFilePath, xIndex - 1) & ".txt"
            xDoc.SaveAs Left(xFilePath, xIndex - 1) & ".txt", FileFormat:=wdFormatText, AddToRecentFiles:=False
            xDoc.Close True
        End If
        xFileStr = Dir()
    Wend
    Application.ScreenUpdating = True
End Sub

So far: (Updated now improved - Same as submitted answer) I have created the following working script in VBA, which achieves these two steps:

References, Acrobat, and Microsoft Scripting Runtime.

Sub LoopThroughFiles()
    
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        
        Debug.Print StrFile
        n = StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
    'Convert to WordDoc
    success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
    StrFile = Dir
    On Error Resume Next
        
    oWd.Quit
        
    'Convert to PlainText
    Debug.Print pdfPath & ".doc"

    success2 = GetTextFromWord(pdfPath & ".doc", n)
    
Loop
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without

        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

Function GetTextFromWord(DocStr As String, n)

    Dim filePath As String
    Dim fso As FileSystemObject
    Dim fileStream As TextStream
    Dim oWd As Object, oDoc As Object, fileRoot As String
    Const wdFormatText As Long = 2, wdCRLF As Long = 0
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    fileRoot = "C:\temp\PDFs" 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(DocStr)
            On Error GoTo 0      'stop ignoring errors
            
            Debug.Print n
            If Not oDoc Is Nothing Then
                filePath = fileRoot & n & ".txt"  'filename
                Debug.Print filePath
                
                
        oDoc.SaveAs2 Filename:=filePath, _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
        
    oDoc.Close False
    
    End If
    oWd.Quit
                
   
    GetTextFromWord = success2
    
End Function

Please note I am not good at all with VBA; much of this is stitching together answers previously provided and trying to get it to loop through. I am hoping someone with VBA experience can review this and really make it more robust.

It does work, albeit quite slowly, to generate the doc files and then text files:

I hope someone familiar with VBA can help me make this solution more robust.

The files can be downloaded here: https://1drv.ms/u/s!AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=2b6DNg

Add to a PDF folder in temp, and the code should run okay.

Please let me know if you require any more information. I think this is it after a week of questions. Just the code needs tidying up.

Finally, if anyone who comes across this knows of any program that can generate plain text without inserting line breaks but retaining carriage returns, please let me know. Acrobat would be the solution and does work for most cases but has to generate tags on some PDFs, which has failed in my case. I would be very interested in an existing program that can in Batch convert PDFs in this way.

The fundamental problem with your approach is that it strips out all other formatting. For a way to remove line breaks while preserving formatting in converted PDFs, see: https://www.msofficeforums.com/word/29880-cleaning-text-pasted-websites-mails-pdfs-etc.html. The macro there even cleans up unnecessary hyphenation. — macropod, Sep 07 '22 at 09:28
I do not want to remove all line breaks, only those inserted due to a sentence coming to the end of a page. The script above is not perfect, but it is the closet I have come to achieving this. If you have acrobat and save a PDF as Plain text, and then as acesstext and compare the two outputs you will hopefully see what I am trying to achieve. But yes your description of removing INSERTED line breaks whilst preserving the rest of the formatting is correct. — Nick, Sep 07 '22 at 10:55
I gave what you suggested a go and it appears to just generate a bulk of text altogether. Not sure if I am missing something but this is sort of what I am trying to avoid — Nick, Sep 07 '22 at 11:57
Hey, I was just adding an update of a similar problem I came across. To be honest I have moved on from this though it is still very useful the solutions provided. It is word that identified the relative x&y to identify the carriage returns. — Nick, Dec 21 '22 at 20:33

Nick · Accepted Answer · 2022-09-10T14:42:25.383

Improved Answer that enables word parameters

ChangeEncoding:=1252 to 65001 for unusual characters(Added below):

Sub LoopThroughFiles()
    
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        
        Debug.Print StrFile
        n = StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
    'Convert to WordDoc
    success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
    StrFile = Dir
    On Error Resume Next
        
    oWd.Quit
        
    'Convert to PlainText
    Debug.Print pdfPath & ".doc"

    success2 = GetTextFromWord(pdfPath & ".doc", n)
    
Loop
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without

        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

Function GetTextFromWord(DocStr As String, n)

    Dim filePath As String
    Dim fso As FileSystemObject
    Dim fileStream As TextStream
    Dim oWd As Object, oDoc As Object, fileRoot As String
    Const wdFormatText As Long = 2, wdCRLF As Long = 0
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    fileRoot = "C:\temp\PDFs" 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(DocStr)
            On Error GoTo 0      'stop ignoring errors
            
            Debug.Print n
            If Not oDoc Is Nothing Then
                filePath = fileRoot & n & ".txt"  'filename
                Debug.Print filePath
                
                
        oDoc.SaveAs2 Filename:=filePath, _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=False, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=65001, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
        
    oDoc.Close False
    
    End If
    oWd.Quit
                
   
    GetTextFromWord = success2
    
End Function

I think using VBA that's it tbh! I guess it could be tidied up a bit better but apart from that the text file results im seeing are probably the best I can achieve. Nicely keeps sentences together like 99% of the time. the occasional line needs tweaking in the text file generates prior to further use with power query but this is no big issue. I appreciate all your help. — Nick, Sep 09 '22 at 20:05
Will likely turn my attention to using command prompt tools to see if this can be achieved. however there's something about acrobats interpretation when generating the word doc that enables much of this as for the most part sentences are identified as whole when switching on the paragraph marks. — Nick, Sep 09 '22 at 20:08
I gave it a go but must've done something wrong as a result was just a large block of text. — Nick, Sep 09 '22 at 20:11
Yes, its just interesting that apart from Acrobat there doesn't appear to be software that can handle this. The solution you have helped me with is the best I think can be done. Maybe an AI solution in the future. But with that, you could just ask it to interpret your questions for the PDF. — Nick, Sep 09 '22 at 20:40
Feel free to modify my posts if you think appropriate. I am unsure if you saw my very first post on this topic. Fundamentally I want a dynamic way of searching any PDF sentence by sentence. I have to accept this there is unlikely any perfect solution given all variables. However conversion to word appears to have reasonable interpretation from SDS documents and I assume other PDFs which are simpler than this. With the exception of tables in a PDF I can’t think of a more complex scenario that the above so I am confident this will work for most scenarios with reasonable success. — Nick, Sep 10 '22 at 21:44
I think you agree that the Text files generated are actually quite decent? They have exceeded my expectations when thinking that a PDF doesn’t really have structure. The conversion gives it this and makes automation of PDF searching, in excels power query, much more powerful than I was a week ago where sentences were getting jumbled and wording no separated correctly. — Nick, Sep 10 '22 at 21:47
I suppose the name was of the title is modified to attract attention to anyone who may come across a similar issue. Line feed is correct as this is what word says. Carriage return May not be correct but the effect is the same via this method so may be approximate for similar searched questions. — Nick, Sep 10 '22 at 22:05
Ah I see what you are saying now. I do still think for the purpose of the question, this is useful, but I take your point. Interesting you say pdftotext can do this, I didn't find a way. May have been too preoccupied with this as it always felt on the cusp of being achieved. — Nick, Sep 10 '22 at 23:42

score 0 · Answer 2 · edited Sep 17 '22 at 08:37

0

Try using below:

  strTemp = Replace(FromString, vbCr, " ")


 strTemp = Replace(strTemp, vbLf, " ")
 strTemp = Replace(strTemp, vbNewline," ")

I use the free tool xpf reader to convert a pdf.

edited Sep 17 '22 at 08:37

Iva

2,447
1
18
28

answered Sep 07 '22 at 07:22

Daniel Sanders

1
1

can xpdf reader avoid inserting line breaks and keep carriage returns? – Nick Sep 07 '22 at 08:30
The issue with the script currently is not really the text file generated (albeit some like 1.2. Section X numbers are being omitted). Its more that it is a bit of a botched job and Im wondering if it can be written better. – Nick Sep 07 '22 at 08:32
@KJ I have downloaded poppler on mac. Do you mind if I message you directly to work this out? – Nick Sep 08 '22 at 23:39

Extracting whole Sentences from PDFs (as best as possible) - Plain Text From PDF without inserting line breaks

2 Answers2

Linked