0

I have the following Code in VBA following an answer to my last question, which iterates over a list of URLs and generates a text file using the word to extract the text.

For the following URL however; https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16, the code fails with Run-time error '5' Invalid procedure call or argument.

Whats strange is the text of the PDF prints to the console but wont write to the text file.

I can't quite understand why this happens as the PDF appears to be no different to the others, which are successful.

VBA Code: requires Microsoft Scripting Runtime reference

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject
    Set fso = New FileSystemObject
    Dim fileStream As TextStream
    
    Dim oWd As Object, oDoc As Object, c As Range
    
    Set oWd = CreateObject("word.application")
    
    n = 1
    
    For Each c In Range("B2:B200").Cells
    
    filePath = Range("D2").Value & "\" & Range("A" & n).Value & ".txt"
    
    Debug.Print filePath
    
    Set fileStream = fso.CreateTextFile(filePath)
   
    Debug.Print c.Value
        
        With oWd.Documents.Open(c.Value)

            Debug.Print .Range.Text
            'write to a file...
            fileStream.WriteLine .Range.Text
            fileStream.Close
            
        End With
        
        n = n + 1
        
    Next c
    
    oWd.Quit
    
End Sub

SetUp: enter image description here

URLs:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

So firstly, I think it's likely that the issue with the last URL can be solved. However, I think I also need to introduce some error handling that generates a blank text file and moves on to the next but im not sure how to achieve this.

I am not very competent with VBA, I have specified to repeat for rows B2:B200, but ideally, it would be good if, regardless of how many URLs, it just works to that number.

Also not sure if the logic in my code is particularly robust/if there is a better way to extract text from a URL.

The expected output is:

enter image description here

And here is an example of the generated Text file.

enter image description here

Nick
  • 789
  • 5
  • 22
  • Hi, I appreciate your response and think it will be useful for others however, I am trying to achieve this through my company's computer. When I run your suggestion I get the error `pdftotext' is not recognized as an internal or external command`. this is an issue because i.e cannot really download from Github, especially since I am trying to create an excel based tools for other colleagues and they would have to download it too. I suspect I will use your suggestion for my own personal use – Nick Aug 21 '22 at 13:02
  • im sorry, im not really proficient enough with this, I just wanted a way to get the Text files rapidly from a list of PDF URLs so I can get the data back into power query and go from there. If you want to provide a series of steps I can follow as an answer I will be more than happy to follow them – Nick Aug 21 '22 at 13:23

2 Answers2

1

Seems like the document returned from that problem URL contained some characters which couldn't be written to a non-unicode text file.

See comments inline:

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(url)
            On Error GoTo 0      'stop ignoring errors
            If Not oDoc Is Nothing Then
                filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                Debug.Print filePath
                'open text stream as unicode
                Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                fileStream.Write oDoc.Range.Text
                fileStream.Close
                oDoc.Close
                c.Interior.Color = vbGreen 'flag OK
            Else
                c.Interior.Color = vbRed   'flag problem
            End If
        End If 'have url
    Next c
    
    oWd.Quit
End Sub
Tim Williams
  • 154,628
  • 8
  • 97
  • 125
  • Hey again, this works really well however I have come across an example: `https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b` where the word conversion detected the text at the start as an image. I suspect this is more of an issue with Word processing however wonder if you think it would be possible to explicit how it generates the word file from the PDF. i.e. no images – Nick Aug 22 '22 at 13:50
  • I see that too - Seems like Word might have limitations on how well it converts from PDF. You might need to use a different tool for the conversion. If you have the full Acrobat (not just the "Reader" version) you can likely script that to do the conversion. – Tim Williams Aug 22 '22 at 16:03
  • I have adobe acrobat reader DC. I can Save a downloaded PDF as a text file manually using this. Unsure if this will work. – Nick Aug 22 '22 at 17:39
  • If this is work-related then presumably your employer should be willing to give you the "pro" upgrade... If not (and you have the required permissions) then you could look at one of the tools mentioned in the other answers/comments – Tim Williams Aug 22 '22 at 17:46
  • im curious why you specify pro, is this necessary? Although Im sure I can ask the company to have this installed, if successful im keen to share this tool with others. Just how things work but when people have to start downloading other things (which can take a few days to be approved) it tends to kill the momentum. – Nick Aug 22 '22 at 17:51
  • I'm not an expert but I *think* the reader version has fewer options for automation vs the "acrobat" version - see here for an intro: http://khkonsulting.com/2009/03/adobe-acrobat-and-vba-an-introduction/#:~:text=What%20I%E2%80%99m%20presenting%20here%20is%20valid%20for%20the%20Adobe%20Acrobat%2C%20Reader%20only%20supports%20a%20small%20subset%20of%20features – Tim Williams Aug 22 '22 at 17:58
  • Or maybe https://www.exceldemy.com/extract-specific-data-from-pdf-to-excel-using-vba/ – Tim Williams Aug 22 '22 at 18:02
1

using excel notepad or any text means your comfortable with such as export columns as csv build a cmd file that can have extra constructs like not show during runtime, but personally like to see confirmed progress NOTE using call the second parameter must be "quoted".

download2txt.cmd

call URLpdf2txt Name1  "https://hpvchemicals.oecd.org/UI/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7"
call URLpdf2txt Name2  "https://hpvchemicals.oecd.org/UI/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725"
call URLpdf2txt Name3  "https://hpvchemicals.oecd.org/UI/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b"
call URLpdf2txt Name4  "https://hpvchemicals.oecd.org/UI/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4"
call URLpdf2txt Name5  "https://hpvchemicals.oecd.org/UI/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd"
call URLpdf2txt Name6  "https://hpvchemicals.oecd.org/UI/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683"
call URLpdf2txt Name7  "https://hpvchemicals.oecd.org/UI/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea"
call URLpdf2txt Name8  "https://hpvchemicals.oecd.org/UI/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b"
call URLpdf2txt Name9  "https://hpvchemicals.oecd.org/UI/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16"
call URLpdf2txt Name10  "https://hpvchemicals.oecd.org/UI/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6"

in comment you mention pdftotext is not installed as a native command so first step would be to ensure a local copy so a helper script as specified URLpdf2txt would be

URLpdf2txt.cmd

@echo off
if not exist xpdf-tools-win-4.04/bin32/pdftotext.exe curl -o %temp%\xpdftools.zip https://dl.xpdfreader.com/xpdf-tools-win-4.04.zip && tar -m -xf %temp%\xpdftools.zip xpdf-tools-win-4.04/bin32/pdftotext.exe
curl -o "%~dpn1.pdf" "%~2"
"xpdf-tools-win-4.04/bin32/pdftotext.exe" -nopgbrk -layout -enc UTF-8 "%~dpn1.pdf" "%~dpn1.txt"

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36
  • `This program is blocked by group policy`. Will use it for personal though use as quicker. – Nick Aug 22 '22 at 14:03
  • Do you know what the Mac equivalent of this would be? Want to test out on my own laptop before suggesting to work – Nick Sep 04 '22 at 10:54