Regex to strip BBCode

Question

I need a regular expression to strip out any BBCode in a string. I've got the following (and an array with tags):

new RegExp('\\[' + tags[index] + '](.*?)\\[/' + tags[index] + ']');

It picks up [tag]this[/tag] just fine, but fails when using [url=http://google.com]this[/url].

What do I need to change? Thanks a lot.

So you rather want to remove any tag you have given in the `tags` array. — Gumbo, May 11 '09 at 13:09

score 6 · Answer 1 · answered Sep 27 '09 at 14:47

I came across this thread and found it helpful to get me on the right track, but here's an ultimate one I spent two hours building (it's my first RegEx!) for JavaScript and tested to work very well for crazy nests and even incorrectly nested strings, it just works!:

string = string.replace(/\[\/?(?:b|i|u|url|quote|code|img|color|size)*?.*?\]/img, '');

If string = "[b][color=blue][url=www.google.com]Google[/url][/color][/b]" then the new string will be "Google". Amazing.

Hope someone finds that useful, this was a top match for 'JavaScript RegEx strip BBCode' in Google ;)

thanks - this is the only solution on the page that worked for me. — Neuralrank, Jul 25 '14 at 02:10

score 1 · Answer 2 · answered Jul 25 '12 at 15:26

I had a similar problem - in PHP not Javascript - I had to strip out BBCode [quote] tags and also the quotes within the tags. Added problem in that there is often arbitrary additional stuff inside the [quote] tag, e.g. [quote:7e3af94210="username"]

This worked for me:

$post = preg_replace('/[\r\n]+/', "\n", $post);
$post = preg_replace('/\[\s*quote.*\][^[]*\[\s*\/quote.*\]/im', '', $post);
$post = trim($post);

lines 1 and 3 are just to tidy up any extra newlines, and any that are left over as a result of the regex.

Daniel Brückner · Answer 3 · 2009-05-11T13:09:23.317

1

You have to allow any character other than ']' after a tag until you find ' ]'.

new RegExp('\\[' + tags[index] + '[^]]*](.*?)\\[/' + tags[index] + ']');

You could simplify this to the following expression.

\[[^]]*]([^[]*)\[\\[^]]*]

The problem with that is, that it will match [WrongTag]stuff[\WrongTag], too. Matching nested tags requires using the expression multiple times.

edited May 11 '09 at 13:09

answered May 11 '09 at 12:59

Daniel Brückner

59,031
16
99
143

Why should you be at all interested in tag nesting when your goal is to take out any BBcode tags anyway? – Tomalak May 11 '09 at 14:12
[^]] needs escaping to [^\\\]] – Question Mark Sep 27 '09 at 15:11

Tomalak · Answer 4 · 2009-05-11T14:11:17.410

1

To strip out any BBCode, use something like:

string alltags = tags.Join("|");
RegExp stripbb = new RegExp('\\[/?(' + alltags + ')[^]]*\\]');

Replace globally with the empty string. No extra loop necessary.

edited May 11 '09 at 14:11

answered May 11 '09 at 13:01

Tomalak

332,285
67
532
628

[^\\\]] does not match characters other than ']' but characters other than '\' followed by ']' because you must not escape ']' in the first position. Correct is [^]]. – Daniel Brückner May 11 '09 at 13:12
There is no "followed by" in a character class. If anything, the character class matches everything except "\" and "]". I'll take out the surplus backslash. – Tomalak May 11 '09 at 14:11

score 1 · Answer 5 · answered May 11 '09 at 13:44

You can check for balanced tags using a backreference:

 new RegExp('\\[(' + tags.Join('|') + ')[^]]*](.*?)\\[/\\1]');

The real problem is that you cant't match arbitrary nested tags in a regular expression (that's the limit of a regular language). Some languages do allow for recursive regular expressions, but those are extensions (that technically make them non-regular, but doesn't change the name that most people use for the objects).

If you don't care about balanced tags, you can just strip out any tag you find:

 new RegExp('\\[/?(?:' + tags.Join('|') + ')[^]]*]');

Balancing tags is totally irrelevant here. The OP wants the tags removed, not matched. — Tomalak, May 11 '09 at 14:16

score 0 · Answer 6 · answered Sep 27 '09 at 14:51

Remember that many (most?) regex flavours by default do not let the DOT meta character match line terminators. Causing a tag like

"[foo]dsdfs
fdsfsd[/foo]"

to fail. Either enable DOTALL by adding "(?s)" to your regex, or replace the DOT meta char in your regex by the character class [\S\s].

Manu · Answer 7 · 2013-03-06T09:22:20.363

0

this worked for me, for every tag name. it also supports strings like '[url="blablabla"][/url]'

str = str.replace(/\[([a-z]+)(\=[\w\d\.\,\\\/\"\'\#\,\-]*)*( *[a-z0-9]+\=.+)*\](.*?)\[\/\1\]/gi, "$4")

edited Mar 06 '13 at 09:22

answered Mar 04 '13 at 13:41

Manu

99
8

score 0 · Answer 8 · answered May 11 '09 at 12:59

0

I think

new RegExp('\\[' + tags[index] + '(=[^\\]]+)?](.*?)\\[/' + tags[index] + ']');

should do it. Instead of group 1 you have to pick group 2 then.

answered May 11 '09 at 12:59

rudolfson

4,096
1
22
18

[^\\\]] does not match characters other than ']' but characters other than '\' followed by ']' because you must not escape ']' in the first position. Correct is [^]]. – Daniel Brückner May 11 '09 at 13:15

Regex to strip BBCode

8 Answers8