0

Hi all, like many I am transforming my covid time to coding. Coviding? :)

I need to extract substrings from a string with VBA in Excel, and would appreciate suggestions on the solutions available. I thought regex would be the way to go, but am actually quite uncertain as I am fairly unfamiliar with regex and this is perhaps to complicated for it. Perhaps there is some much more simple solution that I am unaware of, any suggestions very much appreciated.

The strings originate in the english language files of the open source mybb forum. I am creating an excel workbook to assist translators, which I will donate to their community.

What is needed is the html removed, and each substring placed in adjacent cells. The pattern is:

  • before <
  • between > and <
  • text after >

Examples of the strings are (without first and last quotation mark):

Example 1:

"You are currently viewing a stripped down version of our content. <a href=\"{1}\">View the full version</a> with proper formatting."
  • string 1 = "You are currently viewing a stripped down version of our content. "
  • string 2 = "View the full version"
  • string 3 = " with proper formatting."

Example 2:

"<b>Private</b> Only you will be able to view this event. (Registered Users Only)."
  • string 1 = "Private"
  • string 2 = " Only you will be able to view this event. (Registered Users Only)."

Example 3:

 " This day does not have any events associated with it.<p><a href=\'calendar.php?action=addevent&amp;calendar={1}&amp;day={2}&amp;month={3}&amp;year={4}\'>Post an Event</a>.</p>"
  • string 1 = "This day does not have any events associated with it."
  • string 2 = "Post an Event"
  • string 3 = "."

Example 4: (This example is the largest one I have seen)

"<p><br />[list]<br />[*]List Item #1<br />[*]List Item #2<br />[*]List Item #3<br />[/list]<br /><ul><li>List item #1</li><li>List item #2</li><li>List Item #3</li>"
  • string 1 = "[list]"
  • string 2 = "[*]List Item #1"
  • string 3 = "[*]List Item #2"
  • string 4 = "[*]List Item #3"
  • string 5 = "[/list]"
  • string 6 = "List item #1"
  • string 7 = "List item #2"
  • string 8 = "List item #3"

Any suggestions would be greatly appreciated.

Edit: Adding a few more samples

<span title=\"{1}\">Today</span>

<span title=\"{1}\">Yesterday</span>

<span title=\"{5}{6}\">{1}{2} {3} {4}</span>

You are currently using <strong>{1}</strong>.

<br /><br />You are encouraged to register; once you register you will be able to post messages, set your own preferences, and maintain a profile.

<br /><br />Some of the features that generally require registration are subscriptions, changing of styles, accessing of your Personal Notepad and emailing forum members.

<br /><br />Cookies are small text documents stored on your computer; the cookies set by this forum can only be used on this website and pose no security risk.

<br /><br />Cookies on this forum also track the specific topics you have read and when you last read them.

<p><br />[url]http://www.example.com/[/url]<br />&nbsp;&nbsp;&nbsp;<a href=\"http://www.example.com/\">http://www.example.com/</a>

<p>[url=http://www.example.com/]Example.com[/url]<br />&nbsp;&nbsp;&nbsp;<a href=\"http://www.example.com/\">Example.com</a>

<p>[email]example@example.com[/email]<br />&nbsp;&nbsp;&nbsp;<a href=\"mailto:example@example.com\">example@example.com</a>
  • 1
    you don't need regex if you are just after the .text content. Use html parser and extract the document text. Is there a source url? Does rely a little on quality of the html as vba html parser implementation is not terribly forgiving. But regex would also suffer and is usually a poor choice for handling html. – QHarr Jun 20 '20 at 23:54
  • I need each substring placed in adjacent cells so that the translation system I am designing will properly work. Final operation will replace that string with the editors translation from the original string. The files are available in the inc\languages\english of mybb: https://mybb.com/download/. I have already stripped away everything prior to the equal sign in those files and categorized, so strings in example above are more what I am going for than the text in the original files. – Karl Kristjansson Jun 21 '20 at 00:08
  • You have got me on track, Thank you! I found Todds reply here which does the trick. https://stackoverflow.com/questions/5327512/convert-html-to-plain-text-in-vba – Karl Kristjansson Jun 21 '20 at 01:34
  • Don't forget you probably should capture the start position for each extracted section in the original text - you can't just replace (eg) strings in the original with their translation if there's a chance that same text might be part of a non-extracted section. Eg "span", "table" etc – Tim Williams Jun 21 '20 at 04:42

2 Answers2

2

Assuming your source strings are in column A:

Sub Demo()
Dim i As Long, r As Long, c As Long, StrIn As String, StrOut As String
With ActiveSheet
  For r = 1 To .UsedRange.SpecialCells(xlCellTypeLastCell).Row
    StrIn = ActiveSheet.Range("A" & r).Text: c = 1
    For i = 0 To UBound(Split(StrIn, ">"))
      If Split(StrIn, ">")(i) <> "" Then
        If Split(Split(StrIn, ">")(i), "<")(0) <> "" Then
          c = c + 1
          .Cells(r, c).Value = Split(Split(StrIn, ">")(i), "<")(0)
        End If
      End If
    Next
  Next
End With
End Sub
macropod
  • 12,757
  • 2
  • 9
  • 21
  • FYI added an alternative approach to your valid solution in order to demonstrate some other array handling methods :+) – T.M. Jun 21 '20 at 14:36
0

This approach demonstrates several steps to handle an array, including the newer FilterXML() function (available since 2013+):

Function RemoveHTML()

Executes the following steps:

  • a) mark html tags by unused character, e.g. "$" via Split() function
  • b) remove html tags via Filter() function
  • c) remove empty array items via FilterXML() function - (available in versions 2013+)
  • d) return a "flat" 1-dim array as function result via Application.Transpose()
Function RemoveHTML(ByVal mystring)
    Dim items
'a) mark html tags by unused character, e.g. "$"
    items = Split(Replace(mystring, ">", "$<"), "<") ' mark html tags by $
'b) remove html tags via Filter()
    items = Filter(items, "$", False)                           ' remove items marked by $
'c) remove empty array items via FilterXML()
    items = WorksheetFunction.FilterXML("<t><s>" & Join(items, "</s><s>") & "</s></t>", "//s[not(.='')]")
'd) return "flat" 1-dim array as function result
    RemoveHTML = Application.Transpose(items)                   ' return "flat" 1-dim array
End Function

Example call

Assumes start of data in cell A2 and results in adjacent columns:

Sub ExampleCall()
With Sheet1                                ' the project's sheet Code(Name), e.g. Sheet1
'[0]define data range
    Dim rng As Range
    Set rng = .Range("A2:A" & .Range("A" & .Rows.Count).End(xlUp).Row)
'[1]assign data in column A to variant 2-dim array
    Dim data: data = rng
'[2]loop through strings
    Dim i As Long
    For i = 1 To UBound(data)
    'a) remove html tags
        Dim items: items = RemoveHTML(data(i, 1))    ' << help function RemoveHTML()
    'b) write results to adjacent columns
        .Range("B1").Offset(i).Resize(Columnsize:=UBound(items)) = items
    Next
End With
End Sub

T.M.
  • 9,436
  • 3
  • 33
  • 57
  • I tested your code, it breaks quite a bit. I added a few more samples including some of the ones that broke with your code in case you'd like to check it. – Karl Kristjansson Jun 22 '20 at 02:26
  • For me it seems to work - which examples do show differring results? - @KarlKristjansson – T.M. Jun 22 '20 at 16:07
  • I tested again with the new examples in the bottom of my question, it broke on the first 4 and then I stopped testing. First 3 broke on B in ExampleCall and example 4 broke on C in the function. I am using Excel v. 16.0.12827.20328. – Karl Kristjansson Jun 22 '20 at 16:48
  • Do you get error messages (which line?) or just differring results? (MS 365/ Excel v. 16.0.12827.20336). -@KarlKristjansson – T.M. Jun 22 '20 at 17:43
  • I tested all the lines in the "more samples", breaks every time in the function on C with error ==> Run-time error '1004: Unable to get the FilterXML property of the WorksheetFunction Class – Karl Kristjansson Jun 23 '20 at 02:51
  • As I use MS 365 with its dynamic array features, it could be (?) that you can't assign the function result to an array by means of VBA. Are you interested in a workaround, though? - @KarlKristjansson – T.M. Jun 23 '20 at 18:44