As you said, you already have a working solution for a different split char in Split string on single forward slashes with RegExp. That code does not actually split the string, but it matches everything except "/
"s. It then returns the result of each individual match in a collection
(yes, it ends up splitting).
What you need to do here is match each character in str
, unless the next characters are either //
or and
. We can use a lookahead for this.
Just change the pattern in your code with the following:
.Pattern = "(?!$)((?:(?!//|\band\b).)*)(?://|and|$)"
Alternatively, if you want to trim spaces for each token, use the following regex:
.Pattern = "(?!$)((?:(?!\s*//|\s*\band\b).)*)\s*(?://|and|$)\s*"
Although this will also match the //
or and
, it's using a ( group )
to capture the actual token. Therefore, you have to add the tokens to the collection using .SubMatches(0)
(what was backreferenced by the first group).
In your code, instead of adding coll.Add r_item.Value
, use:
coll.Add r_item.SubMatches(0)
Note: if your string has line breaks, don't forget to set the rExp
object with .Multiline = True
.
VBA Code:
Sub GetMatches(ByRef str As String, ByRef coll As Collection)
Dim rExp As Object, rMatch As Object
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = True
.MultiLine = True
.Pattern = "(?!$)((?:(?!\s*//|\s*\band\b).)*)\s*(?://|and|$)\s*"
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
For Each r_item In rMatch
coll.Add r_item.subMatches(0)
Next r_item
End If
End Sub
And this is how you can call it with your example:
Dim text As String
text = "t/xt1.//text2,and landslide/ andy // text3- and text4"
'vars to get result of RegExp
Dim matches As New Collection, token
Set matches = New Collection
'Exec the RegExp --> Populate matches
GetMatches text, matches
'Print each token in debug window
For Each token In matches
Debug.Print "'" & token & "'"
Next token
Debug.Print "======="
Each token is printed in the Immediate Window.
- This code is a modified version of the code originally posted by @stribizhev
Output in Immediate Window:
't/xt1.'
'text2,'
'landslide/ andy'
'text3-'
'text4'
=======
More in-depth explanation
You may wonder how this pattern works. I'll try to explain with a detailed description. And to do that, let's take only the significant parts of the pattern, using the following regex (the rest isn't really important):
((?:(?!//|\band\b).)*)(?://|and|$)
It can easily be divided in two constructs:
- First, the subpattern
((?:(?!//|\band\b).)*)
is a group that matches each token, backreferencing the text we want to return for each match. In vba, groups are returned with .SubMatches()
. Let's brake it down:
- The inner expression
(?!//|\band\b).
first checks to guarantee it's not followed by a split string ("//
" or "and
"). If it's not, the regex engine matches one character (notice the dot at the end). And that's it, it matches one character allowed as part of the token we're capturing.
- Now, it's enclosed in
(?:(?!//|\band\b).)*
to repeat it for every char it can match, we get all the characters in the token. This construct is the closest it can get to a while loop.
While it's not followed by a split string, get next char.
- If you think about it, it's the construct
.*
we all know, with an extra condition for each character.
- The second subpattern
(?://|and|$)
is easier, simply match a split string ("//
", "and
" or the end of line). It's inside a non-capturing group, meaning it will be matched, but it won't store a copy of its value.
For example:
text1 a/s and text2 a/b//last
^ ^| | [1]: 1st subpattern, captured in Matches(0).SubMatches(0)
|--------|^-^
| 1 2| [2]: Split string, not captured but included in match
|-----------|
3 [3]: The whole match, returned by Matches(0)
For the second match, Matches(1).Value = " text2 a/b//"
Matches(1).Submatches(0) = " text2 a/b"
The rest of the pattern are simply details:
(?!$)
is to avoid matching an empty string at the end of the line.
- All the
\s*
are there to trim the token (to avoid capturing whitespaces at the beggining or end of a token).