0

I am using excel 2019 and I am trying to extract from a bunch of messed up text cells any (up to 5) word ending with dot that comes after a ].

This is a sample of the text I am trying to parse/clean `

some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. 

`

I expect to get this: ost. ult. lot. sino. collan.

I am using this Function found somewhere on the internet which appears to do the job: `

Public Function RegExtract(Txt As String, Pattern As String) As String

With CreateObject("vbscript.regexp")
    '.Global = True
    .Pattern = Pattern
    If .test(Txt) Then
        RegExtract = .Execute(Txt)(0)
    Else
        RegExtract = "No match found"
    End If
End With

End Function

`

and I call it from an empty cell: =RegExtract(D2; "([\]])(\s\w+[.]){0,5}")

It's the first time I am using regexp, so I might have done terrible things in the eyes of an expert.

So this is my expression: ([]])(\s\w+[.]){0,5}

Right now it returns only ] ost.

Which is much more than I was expecting to be able to do on my first approach to regex, but:

  1. I am not able to get rid of the first ] which is needed to find the place where my useful bits start inside the text block, since \K does not work in excel. I might "find and replace" it later as a smart barbarian, but I'd like to know the way to do it clean, if any clean way exists :)

2)I don't understand how iterators work to get all my "up to 5 occurrencies": I was expecting that {0,5} after the second group meant exactly: "repeat the previous group again until the end of the text block (or until you manage to do it 5 times)".

Thank you for your time :)

--Added after JdvD accepted answer for the records--

I am using this pattern to get all the words ending with dot, after the FIRST occurrence of the closing bracket.

^.*?\]|(\w+\.\s?)|.

This one (without the question mark) instead gets all the words ending with dot, after the LAST occurrence of the closing bracket.

^.*\]|(\w+\.\s?)|.

I was even missing something in my regExtract function: I needed to store the matches into an array through a for loop and then output this array as a string. I was wrongly assuming that the regex engine was already storing matches as a unique string.

The correct RegExtract function to extract EVERY match is the following:

Public Function RegExtract(Txt As String, Pattern As String) As String

Dim rMatch As Object, arrayMatches(), i As Long

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = Pattern
    If .Test(Txt) Then
        For Each rMatch In .Execute(Txt)
            If Not IsEmpty(rMatch.SubMatches(0)) Then
                ReDim Preserve arrayMatches(i)
                arrayMatches(i) = rMatch.SubMatches(0)
                i = i + 1
            End If
        Next
        RegExtract = Join(arrayMatches, " ")
    Else
        RegExtract = "No match found"
    End If
End With

End Function
Serbiss
  • 3
  • 6

3 Answers3

5

RegexMatch:

In addition to the answer given by @RonRosenfeld one could apply what some refer to as 'The Best Regex Trick Ever' which would imply to first match what you don't want and then match what you do want in a capture group. For example:

^.*\]|(\w+\.)

See an online demo where in short this means:

  • ^.*\] - Match 0+ (Greedy) characters from the start of the string upto the last occurence of closing square brackets;
  • | - Or;
  • (\w+\.) - Capture group holding 1+ (Greedy) word-characters ending with a dot.

Here is how it could work in an UDF:

Sub Test()

Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "

Debug.Print RegExtract(s, "^.*\]|(\w+\.)")

End Sub

'------

'The above Sub would invoke the below function as an example.
'But you could also invoke this through: `=RegExtract(A1,"^.*\]|(\w+\.)")`
'on your sheet.

'------

Public Function RegExtract(Txt As String, Pattern As String) As String

Dim rMatch As Object, arrayMatches(), i As Long

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = Pattern
    If .Test(Txt) Then
        For Each rMatch In .Execute(Txt)
            If Not IsEmpty(rMatch.SubMatches(0)) Then
                ReDim Preserve arrayMatches(i)
                arrayMatches(i) = rMatch.SubMatches(0)
                i = i + 1
            End If
        Next
        RegExtract = Join(arrayMatches, " ")
    Else
        RegExtract = "No match found"
    End If
End With

End Function

RegexReplace:

Depending on your desired output one could also use a replace function. You'd have to match any remaining character with another alternative for that. For example:

^.*\]|(\w+\.\s?)|.

See an online demo where in short this means that we added another alternative which is simply any single character. A 2nd small addition is that we added the option of an optional space character \s? in the 2nd alternative.

Sub Test()

Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "

Debug.Print RegReplace(s, "^.*\]|(\w+\.\s?)|.", "$1")

End Sub

'------

'There are now 3 parameters to parse to the UDF; String, Pattern and Replacement.

'------

Public Function RegReplace(Txt As String, Pattern As String, Replacement) As String

Dim rMatch As Object, arrayMatches(), i As Long

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = Pattern
    RegReplace = Trim(.Replace(Txt, Replacement))
End With

End Function

Note that I used Trim() to remove possible trailing spaces.


Both RegexMatch and RegexReplace would currently return a single string to clean the input but the former does give you the option to deal with the array in the arrayMatches() variable.

JvdV
  • 70,606
  • 8
  • 39
  • 70
  • Thank you a lot for your exahustive answer, and neat trick! I can tell that the function works as expected, but I noticed that in the blobs of text to clean up I sometimes have other square brackets. Is it possible to fix the expression to negate everything before the first occurrence of ] (not the last as it works now) ? Thank you! – Serbiss Nov 10 '22 at 14:31
  • Silly me! I just needed to remove .* from your negation pattern! :) – Serbiss Nov 10 '22 at 14:41
  • @Serbiss I think you should change `^.*\]` into a lazy pattern using `^.*?\]` – JvdV Nov 10 '22 at 14:53
  • do you think there is some advange using one solution (^\]) over the other ( ^.*?\] ) ? Both appear to work – Serbiss Nov 10 '22 at 15:06
  • If you show me the exact pattern you used I could. Because simply `^\]` is only matching a closing bracket directly at the beginning of a string. – JvdV Nov 10 '22 at 15:09
  • @JcdV of course, it's your pattern :) `"^\]|(\w+\.\s?)|."` VS `"^.*?\]|(\w+\.\s?)|."` – Serbiss Nov 10 '22 at 15:41
  • Yes, use the 2nd one. The 1st one won't work for what you need @Serbiss. It may work on one string but gives you false result on another – JvdV Nov 10 '22 at 16:05
  • 1
    This is not the method I couldn't recall, but it is much simpler than that method. Thanks for posting. – Ron Rosenfeld Nov 10 '22 at 16:36
  • @RonRosenfeld, your welcome. Makes me curious as to what you've seen before though. – JvdV Nov 10 '22 at 18:00
3

There is a method to return all the matches in a string starting after a certain pattern. But I can't recall it at this time.

In the meantime, it seems the simplest would be to remove everything prior to the first ], and then apply Regex to the remainder.

For example:

Option Explicit
Sub findit()
  Const str As String = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan."
  Dim RE As RegExp, MC As MatchCollection, M As Match
  Dim S As String
  Dim sOutput As String
  
S = Mid(str, InStr(str, "]"))

Set RE = New RegExp
With RE
    .Pattern = "\w+(?=\.)"
    .Global = True
    If .Test(S) = True Then
        Set MC = .Execute(S)
        For Each M In MC
            sOutput = sOutput & vbLf & M
        Next M
    End If
End With


MsgBox Mid(sOutput, 2)

End Sub

You could certainly limit the number of matches to 5 by using a counter instead of the For each loop

enter image description here

Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60
  • Thank you for your answer, I didn't thought about a for loop to store all the matches! I marked JvdV as best answer because it covers everything. – Serbiss Nov 10 '22 at 14:44
1

You can use the following regex

([a-zA-Z]+)\.

Let me explain a little bit.

[a-zA-Z] - this looks for anything that contain any letter from a to z and A to Z, but it only matches the first letter.

\+ - with this you are telling that matches all the letters until it finds something that is not a letter from a to z and A to Z

\. - with this you are just looking for the . at the end of the match

Here the example.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
RemDosal
  • 23
  • 4
  • Thanks for the answer, this is indeed another way to do it (but I still need to find the first closing square bracket) – Serbiss Nov 10 '22 at 14:37