4

I'm working with a really large spreedsheet in Open Office and I've had to learn regular expressions to clean it up.

Right now I'm trying to remove all <span> tags and I've come up with an expression to do so:

(<span.*?>|</span>)

The problem is that OpenOffice doesn't seem to like the question mark (which should make it ungreedy), so when I try to remove the <span> tags, it removes most of my string.

Here is a sample of the data: http://pastebin.com/AKWZJJCv

What is an alternative way of reming the <span> tags that would work in OpenOffice's find and replace?

Sebastien C.
  • 4,649
  • 1
  • 21
  • 32
  • If you observe that `.*?` remains greedy, it would point to the fact that the regular expression is not read as a perl-compatible regex (PCRE), but as, for example, Basic/Extended/POSIX regex (none of which know the `?` modifier to non-greedify `.*`) – jørgensen Jan 20 '12 at 19:10
  • However, OpenOffice is Java based. I would be surprised if it did not use the Java regex engine. I wonder what is going on there. – 700 Software Jan 20 '12 at 19:22

2 Answers2

2

You could also try (<span[^>]*>|</span>)

antiduh
  • 11,853
  • 4
  • 43
  • 66
  • That did the trick. Thank you! If you don't mind me asking, what does `[^>]*` mean? I know that the `[^>]` will match the first `>`, but if the `*` means 0 or more, then why is it needed? –  Jan 20 '12 at 19:36
  • 1
    `[]` is the character class. `[abcd]` says "match exactly one character from the input that is either a, b, c, or d". `[^]` is the negative character class, which says "match any one character thats not in the class". I told it "match any number of characters thats not a '>', then match a '>'." – antiduh Jan 20 '12 at 19:44
1

Give this a try:

<(\/)?span([a-zA-z\-\="0-9 ]*)?>

Tested here.

Rick Kuipers
  • 6,616
  • 2
  • 17
  • 37