1

I have the following string:

<SEM>electric</SEM> cu <SEM>hello</SEM> rent <SEM>is<I>love</I>, <PARTITION />mind

I want to find the last "SEM" start tag before the "PARTITION" tag. not the SEM end tag but the start tag. The result should be:

<SEM>is <Im>love</Im>, <PARTITION />

I have tried this regular expression:

<SEM>[^<]*<PARTITION[ ]/>

but it only works if the final "SEM" and "PARTITION" tags do not have any other tag between them. Any ideas?

shabby
  • 3,002
  • 3
  • 39
  • 59

6 Answers6

7

Use String.IndexOf to find PARTITION and String.LastIndexOf to find SEM?

int partitionIndex = text.IndexOf("<PARTITION");
int emIndex = text.LastIndexOf("<SEM>", partitionIndex);
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • thats really gr8 jon but wouldnt it have been much better if ud have helped me with a regex.....plz thnask anywasy – shabby Nov 25 '08 at 11:22
  • Why would it have been better? If this method work for what you needs, why do you want to muddy the waters with regex? – ZombieSheep Nov 25 '08 at 11:31
  • Silly question... what if he needs this for a Regex validator. :) – Timothy Khouri Nov 25 '08 at 11:34
  • 1
    I take the approach of only using a regex if I actually *need* a regex. Nothing in the question suggested that was a requirement - only that that was the approach taken so far. – Jon Skeet Nov 25 '08 at 11:38
  • yeah, TOTALLY don't use regex if you already have a way to do it with tokens etc. – Keng Nov 25 '08 at 15:03
3

And here's your goofy Regex!!!

(?=[\s\S]*?\<PARTITION)(?![\s\S]+?\<SEM\>)\<SEM\>

What that says is "While ahead somewhere is a PARTITION tag... but while ahead is NOT another SEM tag... match a SEM tag."

Enjoy!

Here's that regex broken down:

(?=[\s\S]*?\<PARTITION) means "While ahead somewhere is a PARTITION tag"
(?![\s\S]+?\<SEM\>) means "While ahead somewhere is not a SEM tag"
\<SEM\> means "Match a SEM tag"
Timothy Khouri
  • 31,315
  • 21
  • 88
  • 128
2

If you are going to use a regex to find the last occurrence of something then you might also want to use the right-to-left parsing regex option:

new Regex("...", RegexOptions.RightToLeft);
Pent Ploompuu
  • 5,364
  • 1
  • 27
  • 47
1

The solution is this, i have tested in http://regexlib.com/RETester.aspx

<\s*SEM\s*>(?!.*</SEM>.*).*<\s*PARTITION\s*/> 

As you want the last one, the only way to identify is to find only the characters that don't contain </SEM>.

I have included "\s*" in case there are some spaces in <SEM> or <PARTITION/>.

Basically, what we do is exclude the word </SEM> with:

(?!.*</SEM>.*)
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
netadictos
  • 7,602
  • 2
  • 42
  • 69
0

Have you tried this:

<EM>.*<PARTITION\s*/>

Your regular expression was matching anything but "<" after the "EM" tag. Therefore it would stop matching when it hit the closing "EM" tag.

Kent Boogaart
  • 175,602
  • 35
  • 392
  • 393
0

Bit quick-and-dirty, but try this:

(<SEM>.*?</SEM>.*?)*(<SEM>.*?<PARTITION)

and take a look at what's in the C#/.net equivalent of $2

The secret lies in the lazy-matching construct (.*?) --- I assume/hope C# supports this.

Clearly, Jon Skeet's solution will perform better, but you may want to use a regex (to simplify breaking up the bits that interest you, for example).

(Disclaimer: I'm a Perl/Python/Ruby person myself...)

Brent.Longborough
  • 9,567
  • 10
  • 42
  • 62