PDF to plain text, Some difficult pages were encountered Adobe Acrobat XI

Question

Basic Problem: For this PDF: https://1drv.ms/u/s!AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=xSu2ZR

I am unable to convert/Save manually as plain text using Adobe Acrobat XI standard or the batch conversion script (below). The generated file is blank.

Full problem: As part of my attempts to batch convert PDFs to text, I have run into a strange error where acrobat XI returns the following:

Disappointingly clicking ok generates the text file blank.

The following script to loop through PDF files and convert them to text files using acrobat: It works fine for most PDFs except ones with figures like above.

Sub LoopThroughFiles()
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        Debug.Print StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
        success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".txt")
        
        StrFile = Dir
        
        On Error Resume Next
        
        
    Loop
End Sub


'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without


        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

The error appears to be jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text" If instead I use jsObj.SaveAs textPath, "com.adobe.acrobat.accesstext" the text file is generated but for my needs it is important the file generates is in the plain text format.

The reason for this can be seen below in a different PDF. These are the different types of text files generated:

Plain text (extends as sentences in the horizontal direction - this is required):

Access Text: (creates more of a body of text - this separated sentences by carriage return and is problematic)

I reckon this is a lost cause for these sorts of PDFs; disappointing, though, as many of the PDFs I need to convert are in this format. Appear to have been plagued with issues trying to solve this one.

Anyway just wondered if it may be possible to disable the popup message, and maybe this will allow the plain-text write to occur?

Alternatively can't think of much else.

Tbh I agree, Its just the person who has given me the PDFs appears to be quite cautious with these and doesn't want them sharing. I suspect that it may not be public and may be internal. — Nick, Sep 01 '22 at 13:19
I may just ask my company to allow the extraction method you proposed. @KJ do you think it would come out as plain text? — Nick, Sep 01 '22 at 13:20
Ive requested for Xpdf to be installed. so hopefully your answer will prove useful — Nick, Sep 01 '22 at 16:11
Your _Problematic PDF_ unfortunately appears not to be downloadable anymore. — mkl, Sep 04 '22 at 13:49
@mkl sorry about that, just updated the link for you. Just tried it signed out and it worked so should work for you too. let me know — Nick, Sep 04 '22 at 13:52
Ok, I could download it but didn't find anything special in it at first glance. The problem is not unknown, though, see e.g. [this Adobe community support forum thread](https://community.adobe.com/t5/acrobat-sdk-discussions/saving-pdf-as-plain-text/td-p/10120478). Apparently Acrobat internally tries to make the file a tagged file before plain text extraction, and for PDFs like yours the auto-tagging functionality fails because it cannot recognize the meaning of structures. — mkl, Sep 05 '22 at 10:15
Do you think it would be possible to switch off the auto-tagging functionality? I think as a work around it may be possible to utilise the access text and apply transformations such that it is equivalent to the plain text — Nick, Sep 05 '22 at 11:24
*"Do you think it would be possible to switch off the auto-tagging functionality?"* - I have no idea how to try that. — mkl, Sep 05 '22 at 13:49
@Mkl turns out you can switch it off in preferences however this doesn't change the output — Nick, Sep 05 '22 at 15:00
That's unfortunate. Sorry, I've no idea except switching the product, either updating Acrobat (and hoping the save-as functionality for plain-text has improved) or using a completely different product for text extraction. — mkl, Sep 05 '22 at 15:33

K J · Answer 1 · 2022-09-07T02:52:33.100

It looks like your Acrobat version 11 has issues since "Works for Me" but using older version Reader 9, however its textport as plain text, is goingt to be what you get from pdftotext e.g. left aligned single lines, unsure if a 10 Pro or 20## might be good enough, when did Adobe massage the natural pdf output to richer ?

Reader 9 export as plain text

Opening in other viewers works well enough to save as word or wordpad

Or edit the PDF before save as Docx or convert to text

Using pdftotext will result in a layout reflecting the true output of characters on the page (I call that Plain Text). However your desire is to remove single line feeds (and possibly EOL hyphens). SO that can be done by any Find And Replace Text processing after extraction. Here I outline a possible method.

txt2par.cmd

@echo off
if not exist "%~dpn1.txt" goto help

REM because of method we need to append an extra new line to input (some cases may need two?)
echo/&echo Preparing files
echo/>temp_nl.txt&copy /b "%~dpn1.txt"+temp_nl.txt temp_out.txt >nul:

REM tool will not replace files in binary mode unless it sees there is a dummy backup to use !
echo temp_nl.txt >temp_out.txt.bak

echo/&echo Processing ...&echo/
REM 1st pass ensure binary line feeds are converted to some plain text
fart.exe  -q -b --binary --c-style  temp_out.txt "\x0D\x0A" "<NL>" >nul: 2>&1

REM 2nd pass ensure double "<NL><NL>" are converted back to single new line
fart.exe  --c-style  temp_out.txt "<NL><NL>" "\x0D\x0A\x0D\x0A"

echo/&echo de-hypenating line ends&echo/
REM 3rd pass remove hyphenation (Caution that may not always be desirable
fart.exe  --c-style  temp_out.txt "\x2D<NL>" "\x20"

REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20\x20\x20\x20\x20\x20" "\x20\x20\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "\x20\x20\x20" "\x20\x20"
REM 4th pass ensure remaining line markers are converted to single with little leading space
fart.exe  --c-style  temp_out.txt "<NL>\x20\x20" "<NL>\x20"



REM 5th pass ensure remaining line markers are converted to single space
fart.exe  --c-style  temp_out.txt "<NL>" "\x20"

echo/
echo Done
pause
goto eof
:help
echo/
echo Input must be a filename.txt accepts drag and drop
echo/
echo Usage txt2par filename.txt
echo/
echo Will convert single line feeds to space and
echo convert double line feeds to single line gap
echo/
pause

That may be good enough for some sources however needs more consideration for your complex templated layouts. Possibly by not using whitespaces of two or more space bars (easiest done in more powerful string editor or else you jump unknown loops).

Thanks for this, I have previously modified my script to open the PDF in acrobat, save as word doc, open word doc and save as plain text but still some text is omitted and occasionally splits sentences. Perhaps the way plain text is generated in acrobat is unique. Interesting that something is generated in acrobat 9. Will look into this. — Nick, Sep 05 '22 at 07:38
I also found that pdftotext generates the text files in this way. I.e. every line should be a whole sentence. Can even contain multiple sentences per line but they must be complete. As per the example images. — Nick, Sep 05 '22 at 07:41

Nick · Accepted Answer · 2022-09-09T20:19:33.313

From: Plain Text From PDF without inserting line breaks but retaining carriage returns using VBA. Working solution but requires improvement

Change: Encoding:=1252 to 65001 for unusual characters.

Sub LoopThroughFiles()
    
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        
        Debug.Print StrFile
        n = StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
    'Convert to WordDoc
    success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
    StrFile = Dir
    On Error Resume Next
        
    oWd.Quit
        
    'Convert to PlainText
    Debug.Print pdfPath & ".doc"

    success2 = GetTextFromWord(pdfPath & ".doc", n)
    
Loop
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without

        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

Function GetTextFromWord(DocStr As String, n)

    Dim filePath As String
    Dim fso As FileSystemObject
    Dim fileStream As TextStream
    Dim oWd As Object, oDoc As Object, fileRoot As String
    Const wdFormatText As Long = 2, wdCRLF As Long = 0
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    fileRoot = "C:\temp\PDFs" 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(DocStr)
            On Error GoTo 0      'stop ignoring errors
            
            Debug.Print n
            If Not oDoc Is Nothing Then
                filePath = fileRoot & n & ".txt"  'filename
                Debug.Print filePath
                
                
        oDoc.SaveAs2 Filename:=filePath, _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
        
    oDoc.Close False
    
    End If
    oWd.Quit
                
   
    GetTextFromWord = success2
    
End Function

PDF to plain text, Some difficult pages were encountered Adobe Acrobat XI

2 Answers2