5

How to match nth occurrence in a string using regular expression

set test {stackoverflowa is a best solution finding site stackoverflowb is a best solution finding site stackoverflowc is a best solution finding sitestackoverflowd is a best solution finding sitestackoverflowe is a best solution finding site}

regexp -all {stackoverflow} $test 

The above one give "5" as output

regexp {stackoverflow} $test 

The above one give stackoverflow as result, here it is matching first occurrence of stackoverflow (i.e) stackoverflowa

My requirement is i want to match 5th occurrence of stackoverflow (i.e) stackoverflowe from the above given string.

Please some one clarify my question..Thanks

Then another one question

velpandian
  • 431
  • 4
  • 11
  • 23
  • 1
    I believe you want to match "the entire word that contains the 5th occurence of stackoverflow" yes? – brandonscript Jan 23 '14 at 17:39
  • 1
    The 5th occurance is the last occurance. If they aren't the same thing, then you need something like `/(.*?(stackoverflow\S*)){5}/`, answer in capture group 2. Don't know if any syntax like this is supported in tcl. –  Jan 23 '14 at 18:09
  • The syntax is supported, but the result is a little unintuitive: the correct match gets stored in the _second_ match variable. – Peter Lewerin Jan 23 '14 at 18:34
  • @sln: for completeness and clarity: the invocation is `regexp {(.*?(stackoverflow\S*)){5}} $test m sm ssm`. This puts the fifth occurrence of "stackoverflow" (leaving out the "e") in the subsubmatch variable (`ssm` in the invocation). – Peter Lewerin Jan 24 '14 at 06:36

1 Answers1

3

Try

set results [regexp -inline -all {stackoverflow.} $test]
# => stackoverflowa stackoverflowb stackoverflowc stackoverflowd stackoverflowe
puts [lindex $results 4]

I'll be back to explain this further shortly, making pancakes right now.

So.

The command returns a list (-inline) of all (-all) substrings of the string contained in test that match the string "stackoverflow" (less quotes) plus one character, which can be any character. This list is stored in the variable result, and by indexing with 4 (because indexing is zero-based), the fifth element of this list can be retrieved (and, in this case, printed).

The dot at the end of the expression wasn't in your expression: I added it to check that I really did get the right match. You can of course omit the dot to match "stackoverflow" exactly.

ETA (from Donal's comment): in many cases it's convenient to extract not the string itself, but its position and extent within the searched string. The -indices option gives you that (I'm not using the dot in the expression now: the index list makes it obvious which one of the "stackoverflow"s I'm getting anyway):

set indices [regexp -inline -all -indices {stackoverflow} $test]
# => {0 12} {47 59} {94 106} {140 152} {186 198}

You can then use string range to get the string match:

puts [string range $test {*}[lindex $indices 4]]

The lindex $indices 4 gives me the list 186 198; the {*} prefix makes the two elements in that list appear as two separate arguments in the invocation of string range.

Peter Lewerin
  • 13,140
  • 1
  • 24
  • 27
  • I know, right? I made a deal with my kids that one of the days during the week they stay with me, they get pancakes. Which, of course, means I get pancakes too ;) – Peter Lewerin Jan 23 '14 at 18:18
  • What would the list look like if multiple nested capture groups? `{(s(t)(ac)k(overf)(low))(.)}` Or is capture groups allowed? –  Jan 23 '14 at 18:31
  • You get a cyclic result of the whole match (one of `stackoverflowa` ... `stackoverflowe`), then the first submatch (`stackoverflow` in each case), then the four subsubmatches (`t`, `ac`, `overf`, and `low`), and finally in each cycle one of (`a` ... `e`) (the second submatch). – Peter Lewerin Jan 23 '14 at 18:42
  • +1 I was wrongly assuming that OP really meant last, because you know, sometimes what's put in words isn't what's actually meant ^^; – Jerry Jan 23 '14 at 18:51
  • That seems kind of do-able. Would it return sub-matches on something like this: `{(.*?(stackoverflow)(.))+}` or just the last match of inner sub-expression? –  Jan 23 '14 at 18:55
  • @Jerry: True. Going by the question, it was a reasonable assumption that the last match was the one sought after. – Peter Lewerin Jan 23 '14 at 18:55
  • 1
    @sln: it returns four elements: the big match from the beginning of the string up to ..."flowe", then the last submatch, then the two submatches from the last submatch. But you know, I'm not tclsh: why don't you try it yourself? – Peter Lewerin Jan 23 '14 at 19:15
  • Sorry, I don't either, just wondering if its engine is of any value in this regard. Apparently, so far, only Dot-Net will record itterative sub-expressions matches in arrays from quantified outer groups. Its no big deal but, Dot-Net won't do recursive calls per-se and PCRE style won't do intermediate captures. Puts the kabash on meaningfull language parsing. –  Jan 23 '14 at 19:26
  • You can also experiment with passing the `-indices` option to `regexp` so that instead of the string, you get _where_ in the search space the match occurred. Intermixes with `-all` and `-inline`, of course. – Donal Fellows Jan 24 '14 at 00:51
  • The `-indices` option is useful in reality, but seldom yields a well-readable example (string results are obvious in a way that index pairs aren't), so I tend not to mention it when I answer questions like this. Maybe I ought to. – Peter Lewerin Jan 24 '14 at 06:45