10

I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.

I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.

Mark
  • 943
  • 3
  • 10
  • 11

6 Answers6

18

Set a reference to "Microsoft HTML object library".

Function HtmlToText(sHTML) As String
  Dim oDoc As HTMLDocument
  Set oDoc = New HTMLDocument
  oDoc.body.innerHTML = sHTML
  HtmlToText = oDoc.body.innerText
End Function

Tim

Tim Williams
  • 154,628
  • 8
  • 97
  • 125
  • 1
    This works well, but be aware that white space is collapsed. For example `
    this[space][space][space]is
    a[space]test
    ` comes out as `this[space]isa[space]test`. (sorry about the formatting; the extra spaces aren't coming out when I just type them.)
    – Cheran Shunmugavel Mar 17 '11 at 06:04
  • Collapsing whitespace would be "expected behaviour" here I think (unless the element text has been styled using css to preserve whitespace) – Tim Williams Oct 02 '16 at 17:00
5

A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.

Function StripTags(ByVal html As String) As String
    Dim text As String
    Dim accumulating As Boolean
    Dim n As Integer
    Dim c As String

    text = ""
    accumulating = True

    n = 1
    Do While n <= Len(html)

        c = Mid(html, n, 1)
        If c = "<" Then
            accumulating = False
        ElseIf c = ">" Then
            accumulating = True
        Else
            If accumulating Then
                text = text & c
            End If
        End If

        n = n + 1
    Loop

    StripTags = text
End Function

This can leave lots of extraneous whitespace, but it will help in removing the tags.

Todd
  • 5,999
  • 2
  • 21
  • 21
  • 1
    Be careful with this approach. If the input text includes < or > characters for a reason other than tags it will get confused. – Ben Jan 31 '20 at 03:47
5

Tim's solution was great, worked liked a charm.

I´d like to contribute: Use this code to add the "Microsoft HTML Object Library" in runtime:

Set ID = ThisWorkbook.VBProject.References
ID.AddFromGuid "{3050F1C5-98B5-11CF-BB82-00AA00BDCE0B}", 2, 5

It worked on Windows XP and Windows 7.

cbaldan
  • 522
  • 8
  • 16
2

Tim's answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.

 Function HtmlToText(sHTML) As String
      Dim oDoc As HTMLDocument

      If IsNull(sHTML) Then
        HtmlToText = ""
        Exit Function
        End-If

      Set oDoc = New HTMLDocument
      oDoc.body.innerHTML = sHTML
      HtmlToText = oDoc.body.innerText
    End Function
Gardoglee
  • 61
  • 2
2

Here's a variation of Tim's and Gardoglee's solution that does not require setting a reference to "Microsoft HTML object library". This method is known as Late Binding and will also work in vbscript.

Function HtmlToText(sHTML) As String

    Dim oDoc As Object ' As HTMLDocument

    If IsNull(sHTML) Then
        HtmlToText = ""
        Exit Function
    End If

    Set oDoc = CreateObject("HTMLFILE")
    oDoc.body.innerHTML = sHTML
    HtmlToText = oDoc.body.innerText

End Function

Note that if you are using VBA in Access 2007 or greater, there is an Application.PlainText() method built-in that does the same thing as the code above.

Ben
  • 1,168
  • 13
  • 45
1

Yes! I managed to solve my problem as well. Thanks everybody/

In my case, I had this sort of input:

<p>Lorem ipsum dolor sit amet.</p>

<p>Ut enim ad minim veniam.</p>

<p>Duis aute irure dolor in reprehenderit.</p>

And I did not want the result to be all jammed together without breaklines.

So I first splitted my input for every <p> tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).

In addition I concatenated each cleaned 'paragraph' with this breakline character Crh(10) for VBA/Excel.

The final code is:

Public Function HtmlToText(ByVal sHTML As String) As String
    Dim oDoc As HTMLDocument
    Dim result As String
    Dim paragraphs() As String

    If IsNull(sHTML) Then
      HtmlToText = ""
      Exit Function
    End If

    result = ""
    paragraphs = Split(sHTML, "<p>")

    For Each paragraph In paragraphs
        Set oDoc = New HTMLDocument
        oDoc.body.innerHTML = paragraph
        result = result & Chr(10) & Chr(10) & oDoc.body.innerText
    Next paragraph

    HtmlToText = result
End Function

ofundefined
  • 2,692
  • 2
  • 18
  • 35