Finding the last occurrence of a word

Question

I have the following string:

<SEM>electric</SEM> cu <SEM>hello</SEM> rent <SEM>is<I>love</I>, <PARTITION />mind

I want to find the last "SEM" start tag before the "PARTITION" tag. not the SEM end tag but the start tag. The result should be:

<SEM>is <Im>love</Im>, <PARTITION />

I have tried this regular expression:

<SEM>[^<]*<PARTITION[ ]/>

but it only works if the final "SEM" and "PARTITION" tags do not have any other tag between them. Any ideas?

Jon Skeet · Answer 1 · 2008-11-25T11:38:49.330

7

Use String.IndexOf to find PARTITION and String.LastIndexOf to find SEM?

int partitionIndex = text.IndexOf("<PARTITION");
int emIndex = text.LastIndexOf("<SEM>", partitionIndex);

edited Nov 25 '08 at 11:38

answered Nov 25 '08 at 10:00

Jon Skeet

1,421,763
867
9,128
9,194

thats really gr8 jon but wouldnt it have been much better if ud have helped me with a regex.....plz thnask anywasy – shabby Nov 25 '08 at 11:22
Why would it have been better? If this method work for what you needs, why do you want to muddy the waters with regex? – ZombieSheep Nov 25 '08 at 11:31
Silly question... what if he needs this for a Regex validator. :) – Timothy Khouri Nov 25 '08 at 11:34
1

I take the approach of only using a regex if I actually *need* a regex. Nothing in the question suggested that was a requirement - only that that was the approach taken so far. – Jon Skeet Nov 25 '08 at 11:38
yeah, TOTALLY don't use regex if you already have a way to do it with tokens etc. – Keng Nov 25 '08 at 15:03

score 3 · Accepted Answer · answered Nov 25 '08 at 11:36

And here's your goofy Regex!!!

(?=[\s\S]*?\<PARTITION)(?![\s\S]+?\<SEM\>)\<SEM\>

What that says is "While ahead somewhere is a PARTITION tag... but while ahead is NOT another SEM tag... match a SEM tag."

Enjoy!

Here's that regex broken down:

(?=[\s\S]*?\<PARTITION) means "While ahead somewhere is a PARTITION tag"
(?![\s\S]+?\<SEM\>) means "While ahead somewhere is not a SEM tag"
\<SEM\> means "Match a SEM tag"

score 2 · Answer 3 · answered Nov 26 '08 at 02:26

2

If you are going to use a regex to find the last occurrence of something then you might also want to use the right-to-left parsing regex option:

new Regex("...", RegexOptions.RightToLeft);

answered Nov 26 '08 at 02:26

Pent Ploompuu

5,364
1
27
47

score 1 · Answer 4 · edited Nov 25 '08 at 13:19

1

The solution is this, i have tested in http://regexlib.com/RETester.aspx

<\s*SEM\s*>(?!.*</SEM>.*).*<\s*PARTITION\s*/>

As you want the last one, the only way to identify is to find only the characters that don't contain </SEM>.

I have included "\s*" in case there are some spaces in <SEM> or <PARTITION/>.

Basically, what we do is exclude the word </SEM> with:

(?!.*</SEM>.*)

edited Nov 25 '08 at 13:19

VonC

1,262,500
529
4,410
5,250

answered Nov 25 '08 at 12:32

netadictos

7,602
2
42
69

score 0 · Answer 5 · edited Sep 28 '11 at 00:08

0

Have you tried this:

<EM>.*<PARTITION\s*/>

Your regular expression was matching anything but "<" after the "EM" tag. Therefore it would stop matching when it hit the closing "EM" tag.

edited Sep 28 '11 at 00:08

answered Nov 25 '08 at 09:59

Kent Boogaart

175,602
35
392
393

ya i have tried this one matches from the first SEM till the PARTITION tag...thanks anywaz – shabby Nov 25 '08 at 11:19

Brent.Longborough · Answer 6 · 2008-11-25T10:43:45.497

0

Bit quick-and-dirty, but try this:

(<SEM>.*?</SEM>.*?)*(<SEM>.*?<PARTITION)

and take a look at what's in the C#/.net equivalent of $2

The secret lies in the lazy-matching construct (.*?) --- I assume/hope C# supports this.

Clearly, Jon Skeet's solution will perform better, but you may want to use a regex (to simplify breaking up the bits that interest you, for example).

(Disclaimer: I'm a Perl/Python/Ruby person myself...)

edited Nov 25 '08 at 10:43

answered Nov 25 '08 at 10:26

Brent.Longborough

9,567
10
42
62

C# does support this but it matches from the first SEM tag till PARTITION i want the last SEM thans – shabby Nov 25 '08 at 11:21
Sorry, if it does that, then I suspect it ain't supporting it "properly". Could any C# regexp expert lend a hand, please? – Brent.Longborough Nov 30 '08 at 12:00

Finding the last occurrence of a word

6 Answers6

Linked