1

My text is:

999 blaw blaw blaw1 999 blaw blaw blaw

And I want to choose:

blaw blaw blaw1

Now, I could do this using:

([0-9][0-9][0-9] )(.*?)( [0-9][0-9][0-9])

But the problem is I can't use ".*?" in what I'm using. Replacing (.*?) with ([^0-90-90-9]*) would have worked if I didn't have the 1 replaces by the blaw1!

Any suggestions, I'm using Stata if it is relevant.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Arash
  • 403
  • 3
  • 10

2 Answers2

2

Based on the comment by hwnd:

clear
set more off

*----- example data -----

input str60 text
"999 blaw blaw blaw1 999 blaw blaw blaw"
end

list

*----- what you want -----

gen extract = regexs(2) if regexm(text, "(^[0-9][0-9][0-9] )(.+)( [0-9][0-9][0-9])")

list

Also

... regexm(text, "(^[0-9]+ )(.+)( [0-9]+)")

From help regex:

Regular expression syntax is based on Henry Spencer's NFA algorithm, and this is nearly identical to the POSIX.2 standard. [arguments] may not contain binary 0 (\0).

Other references are:

http://www.stata.com/support/faqs/data-management/regular-expressions/

http://www.ats.ucla.edu/stat/stata/faq/regex.htm

Community
  • 1
  • 1
Roberto Ferrer
  • 11,024
  • 1
  • 21
  • 23
  • @Unihedron What is that supposed to mean? Please be explicit when commenting answers. – Roberto Ferrer Sep 30 '14 at 15:22
  • If you are suggesting something like `[0-9]{3}` it won't work. Stata's regular expression feature is very basic. My answer already contains a complete example that actually works. – Roberto Ferrer Sep 30 '14 at 15:26
-2

Try following (?<([a-z]*[0-9]? )*) I am not familiar with Stata but this working in Javascript implementation of regex

Updated to consider backtracking.

sgp667
  • 1,797
  • 2
  • 20
  • 38
  • Apparently Nothing works in Stata. I tried your code in notepad++ and it works nicely but not in Stata. – Arash Sep 30 '14 at 01:33
  • What if you just do ` ([a-z]*[0-9]? )*` because I realized that first part is redundant – sgp667 Sep 30 '14 at 01:34
  • Actually it should be + not * other wise this will match single digit between spaces. So correct version is ([a-z]+[0-9]? )* – sgp667 Sep 30 '14 at 01:39
  • @ArashFarahani if I were you I'd look into Stata's documentation what standard does Stata follow, I always get into trouble when testing regex to greg on the web because Js used standard – sgp667 Sep 30 '14 at 14:26
  • @Unihedron I don't understand why you have to down vote my answer when Arash is no planing to hack himself, and he commented himself that regex performed well on his data set – sgp667 Sep 30 '14 at 21:19
  • To take Unihedron's comment into account you can enclose my regex in (?>regex) on one of him links its says that this is a remedy to backtracking if that becomes a problem – sgp667 Sep 30 '14 at 21:59