4

How do you implement non-greedy matching in Stata using regex? Or does Stata even have this capability?

I want to extract all text that occurs between a hashtag "#" and a period ".".

Example code:

clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#(.*)\.")
list

But in Stata (v.13.1), I can't seem to be able to use the non-greedy character #(.*?)\.. Thus, above code gives this:

+--------------------------------------------------+
|                          var1               var2 |
|--------------------------------------------------|
| anything#aaabbbccc.dddeee.fff   aaabbbccc.dddeee |
|     anything#aaabbbccc.dddeee          aaabbbccc |
|           anything#aaabbbccc.          aaabbbccc |
+--------------------------------------------------+

But what I want is this:

+--------------------------------------------------+
|                          var1               var2 |
|--------------------------------------------------|
| anything#aaabbbccc.dddeee.fff          aaabbbccc |
|     anything#aaabbbccc.dddeee          aaabbbccc |
|           anything#aaabbbccc.          aaabbbccc |
+--------------------------------------------------+
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
user812783765
  • 165
  • 1
  • 1
  • 7

2 Answers2

5

One play on using #(.*?)\. would be to just match any non dot character occurring after the hash sign, i.e. this pattern:

#([^.]*)

Try this code:

clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#([^.]*)")
list

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

Once many programmers have learned about regular expressions, they are reluctant to look elsewhere in string management, and with good reason.

This is just to point out that for the problem given, and many others too, there is a pedestrian alternative:

clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#([^.]*)")

gen where1 = strpos(var1, "#") + 1 
gen where2 = strpos(var1, ".") 
gen var3 = substr(var1, where1, where2 - where1)  

list


     +-------------------------------------------------------------------------+
     |                          var1        var2   where1   where2        var3 |
     |-------------------------------------------------------------------------|
  1. | anything#aaabbbccc.dddeee.fff   aaabbbccc       10       19   aaabbbccc |
  2. |     anything#aaabbbccc.dddeee   aaabbbccc       10       19   aaabbbccc |
  3. |           anything#aaabbbccc.   aaabbbccc       10       19   aaabbbccc |
     +-----------------------------------------------------------------------

Find the positions of the start and end of the substring you want, and extract what lies between. This is resolutely lacking in style, but sometimes gets you there faster. Always remember to account for programmer time in working out the regular expression you need.

Nick Cox
  • 35,529
  • 6
  • 31
  • 47