Split string on double forward slashes and/or on specific word with RegExp

Question

I'm trying to split text by double forward slashes and/or by a specific string such as "and".

Example A:
text1 a/s // text2 a/b
text1 a/s and text2 a/b


Example B:
text1. // text2,// text3-
text1. and text2, and text3-

Example A returns two matches: text1 a/s text2 a/b
Example B returns three matches: text1. text2, text3-

I got very useful tip how to split by single forward slashes: Split string on single forward slashes with RegExp, but trying to find a solution to exclude two forward slashes or a string, proved to be too challenging.

Bonus points if it's possible to combine one solution for both examples combined:

Example C:
text1 a/s // text2, and text3-

I would appreciate only RegExp solutions compatible with VBA.

Please specify if the spaces around `//` and `and` are included in the tokens, or if it's supposed to trim it — Mariano, Sep 17 '15 at 20:19
I was hoping to get solution that it would split with or without spaces. Split and replace is a possible solution, but it is very fixed. Must be with spaces surrounded, otherwise it might split middle of the word or two slashes out of three if someone mistypes. — Trm, Sep 17 '15 at 21:02

score 2 · Accepted Answer · edited May 23 '17 at 12:06

As you said, you already have a working solution for a different split char in Split string on single forward slashes with RegExp. That code does not actually split the string, but it matches everything except "/"s. It then returns the result of each individual match in a collection (yes, it ends up splitting).

What you need to do here is match each character in str, unless the next characters are either // or and. We can use a lookahead for this.

Just change the pattern in your code with the following:

.Pattern = "(?!$)((?:(?!//|\band\b).)*)(?://|and|$)"

Alternatively, if you want to trim spaces for each token, use the following regex:

.Pattern = "(?!$)((?:(?!\s*//|\s*\band\b).)*)\s*(?://|and|$)\s*"

Although this will also match the // or and, it's using a ( group ) to capture the actual token. Therefore, you have to add the tokens to the collection using .SubMatches(0) (what was backreferenced by the first group).

In your code, instead of adding coll.Add r_item.Value, use:

coll.Add r_item.SubMatches(0)

Note: if your string has line breaks, don't forget to set the rExp object with .Multiline = True.

VBA Code:

Sub GetMatches(ByRef str As String, ByRef coll As Collection)

    Dim rExp As Object, rMatch As Object

    Set rExp = CreateObject("vbscript.regexp")
    With rExp
        .Global = True
        .MultiLine = True
        .Pattern = "(?!$)((?:(?!\s*//|\s*\band\b).)*)\s*(?://|and|$)\s*"
    End With

    Set rMatch = rExp.Execute(str)
    If rMatch.Count > 0 Then
        For Each r_item In rMatch
            coll.Add r_item.subMatches(0)
        Next r_item
    End If
End Sub

And this is how you can call it with your example:

Dim text As String
text = "t/xt1.//text2,and landslide/ andy  // text3-  and  text4"

'vars to get result of RegExp
Dim matches As New Collection, token
Set matches = New Collection

'Exec the RegExp --> Populate matches
GetMatches text, matches

'Print each token in debug window
For Each token In matches
    Debug.Print "'" & token & "'"
Next token
Debug.Print "======="

Each token is printed in the Immediate Window.

This code is a modified version of the code originally posted by @stribizhev

Output in Immediate Window:

't/xt1.'
'text2,'
'landslide/ andy'
'text3-'
'text4'
=======

More in-depth explanation

You may wonder how this pattern works. I'll try to explain with a detailed description. And to do that, let's take only the significant parts of the pattern, using the following regex (the rest isn't really important):

((?:(?!//|\band\b).)*)(?://|and|$)

It can easily be divided in two constructs:

First, the subpattern ((?:(?!//|\band\b).)*) is a group that matches each token, backreferencing the text we want to return for each match. In vba, groups are returned with .SubMatches(). Let's brake it down:
- The inner expression (?!//|\band\b). first checks to guarantee it's not followed by a split string ("//" or "and"). If it's not, the regex engine matches one character (notice the dot at the end). And that's it, it matches one character allowed as part of the token we're capturing.
- Now, it's enclosed in (?:(?!//|\band\b).)* to repeat it for every char it can match, we get all the characters in the token. This construct is the closest it can get to a while loop.
  
  While it's not followed by a split string, get next char.
- If you think about it, it's the construct .* we all know, with an extra condition for each character.
The second subpattern (?://|and|$) is easier, simply match a split string ("//", "and" or the end of line). It's inside a non-capturing group, meaning it will be matched, but it won't store a copy of its value.

For example:

text1 a/s and text2 a/b//last
^        ^| |               [1]: 1st subpattern, captured in Matches(0).SubMatches(0)
|--------|^-^
|   1      2|               [2]: Split string, not captured but included in match
|-----------|
      3                     [3]: The whole match, returned by Matches(0)


For the second match, Matches(1).Value = " text2 a/b//"
                      Matches(1).Submatches(0) = " text2 a/b"

The rest of the pattern are simply details:

(?!$) is to avoid matching an empty string at the end of the line.
All the \s* are there to trim the token (to avoid capturing whitespaces at the beggining or end of a token).

You are amazing! That's exactly what i needed and logic I was trying to achieve. If I understand it correctly, we only add first group that is matched? Did not know it was possible and it is very good learning experience! Even though I can't say I understand how it works even after looking at regexp references for the past week. — Trm, Sep 18 '15 at 07:37
I tried to explore the pattern. Here are my comments, perhaps you can help me clarify how it works. http://collabedit.com/ws2wj Thanks again for all the support — Trm, Sep 18 '15 at 08:22
New edit with "More in-depth explanation". I think it covers your comments. Let me know if it helps — Mariano, Sep 18 '15 at 08:39
Great explanation. I would up vote hundred times if i could. To clarify particular part: '(?:(?!//|\band\b).)*)' Shouldn't each character be matched but not captured in memory because of "?:". If I remove it from equation, what affect does it have? I can see changes in https://regex101.com/r/mS5pC6/1 but don't know why — Trm, Sep 18 '15 at 10:33
`(?:group)` does not forbid a capture, think of it as a normal group. It's the other way around, a `(capturing group)` stores what has been matched. In that particular case, it allows the `*` quantifier to be applied to what's inside: `(?!//|\band\b).` is repeated 0 or more times. If you convert it to a capturing group, you will have the last character matched returned by `.SubMatches(1)` (or **2.** in [regex101](https://regex101.com/r/mS5pC6/1)) — Mariano, Sep 18 '15 at 10:44

ManishChristian · Answer 2 · 2015-09-17T15:23:51.540

1

Or simplest way is:

Text = "text1 a/s // text2, and text3-"
text = Replace(text, " // ", vbNewLine)
text = Replace(text, " and ", vbNewLine)

arr = Split(text, vbNewLine)

For Each field In arr
  WScript.Echo Trim(field) 'Using Trim you can remove the spaces around
Next

You will get:

text1 a/s
text2,
text3-

edited Sep 17 '15 at 15:23

answered Sep 17 '15 at 15:19

ManishChristian

3,759
3
22
50

2

you would want to put a space before and after `"and"` so that it does not split a word like `landslide` – Scott Craner Sep 17 '15 at 15:22
The solution is not RegExp based and will be of no practical value for OP. – Wiktor Stribiżew Sep 17 '15 at 15:54
1

I know that, I just gave this answer as an alternative. – ManishChristian Sep 17 '15 at 17:12

Split string on double forward slashes and/or on specific word with RegExp

2 Answers2

VBA Code:

Output in Immediate Window:

More in-depth explanation