1

I have build the following regular expression in order to fix a big sql dump with invalid tags This searches

\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]

This replaces

[video=youtube;$2$4]$1$3[/video]

So this:

[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]

will become

[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]

It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL. So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.

It is quite urgent to do the substitutions so I have no time reading the SED manual.

Can you give a hand to migrate my regular expressions to a SED friendly format?

Thanx in advance!

UPDATE: I added the escape chars proposed

\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]

but I still get errors - Unkown command: ')'

Pytzamarama
  • 11
  • 1
  • 3
  • What `sed` command did you try? – sarnold Jan 31 '11 at 09:23
  • 2
    There are definitely _not_ sed-compatible regular expressions – thkala Jan 31 '11 at 09:27
  • I created a file containing only the search reg expression and executed sed -f regexpscript.txt mytext.txt. I get errors. I used the regular expressions I have learned in the University. I cannot understand the reason SED may not use the standard onomatology. Pitty :( – Pytzamarama Jan 31 '11 at 09:41
  • `sed` is older than you are, @Pytzamarama, and has worked fine for all those years. It uses a particular set of regular expressions. Since you've not shown us exactly the file you are using, nor how you invoke it, there could be a variety of issues with what you've written vs what you need to write. In particular, if you have `s/a/b/`, it is crucial that neither the search regex nor the replacement regex (a and b) contains an unescaped slash. You can, however, use an arbitrary character as the delimiter; try ^G (control-G) for example; that won't appear in a URL. You could use '%' too. – Jonathan Leffler Jan 31 '11 at 15:05
  • 1
    Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. `sed` as standard does not support PCRE. AFAICS, even GNU `sed`, which supports [ERE](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) as well as BRE (extended and basic regular expressions) does not support PCRE. Or only a more recent version of GNU sed than I find on RHEL 5 Linux supports PCRE. (GNU sed 4.2.1 does not support PCRE.) Welcome to the wonderful world of regexes. – Jonathan Leffler Jan 31 '11 at 15:15
  • I've added an update to my answer, there were a couple of unescaped `)`'s in your regex. – ocodo Jan 31 '11 at 20:13
  • Please detail the codepage errors, it's much quicker to fix. – ocodo Feb 01 '11 at 00:52

2 Answers2

2

Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.

Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.

Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
1

Sed just has some different escaping rules to the Regex flavor you're using.

  • () escaped \( \) - for grouping
  • [] are not - for character classes
  • {} escaped \{ \} - for numerators

\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]

I noticed a couple of unescaped )'s on enclosing groups.

ocodo
  • 29,401
  • 18
  • 105
  • 117
  • Thanx for your answer!I did it (I updated the first post) but I still get errors 'Unknown Command ^', 'Unmatched ( or /(' – Pytzamarama Jan 31 '11 at 13:39
  • Sounds like you need to look closely at your regexp and make sure all pairs are closed. – ocodo Jan 31 '11 at 13:41
  • You have to escape slash characters do something like http:\/\/ or replace the outer terms with a # – Foo Bah Jan 31 '11 at 14:53
  • Not all pairs were closed, and I overlooked your `.`'s not being escaped. However, I think you would be better off approaching this problem by fixing the output you get from EditPad, the codepage issue is way less fiddly than fixing your regex. – ocodo Feb 01 '11 at 00:49