Regex for extracting the value in between a html tag

Question

I managed to write this regular expression for getting the inner html from a td tag,

<td[^>]*>(.*?)<\/td>

It is working fine. Except, neglecting the td tag in the matching. I just want to get the innerHTML, not the outerHTML. you can find a demo for my problem here.

Can anyone help me to get text in between the td tag?

P.S I am manipulating a string here not a html element.

[See this answer, you need to access the group](http://stackoverflow.com/a/432503/5527985). If it's multiline content, use `([\s\S]*?)` instead of `(.*?)` because dot will stop at newline. — bobble bubble, Nov 06 '15 at 11:38
@PeeHaa OP clearly states that this has to be done over a string not over a DOM element. And there is nothing wrong in manipulating a html string. — Rajaprabhu Aravindasamy, Nov 06 '15 at 11:40
Yes there is everything wrong with trying to parse html with regex. Convert the string into a dom object and do it proper. — PeeHaa, Nov 06 '15 at 11:40
Hii once try this... `preg_match('/^(.*)/', $value,$match1)`... — phpfresher, Nov 06 '15 at 11:41
It is ok to parse a HTML string with DOM, too. Here is [an example](http://jsfiddle.net/8ncjze9x/1/) from one of my answers. — Wiktor Stribiżew, Nov 06 '15 at 11:41
try this: `preg_match('~<[^>]*>([^<]*)<[^>]*>~s', $value, $match)` — Mayur Koshti, Nov 06 '15 at 11:43
[Here is how you can get `td` innerHTMLs](http://jsfiddle.net/t35u81Le/1/). — Wiktor Stribiżew, Nov 06 '15 at 11:50
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Doro, Nov 06 '15 at 12:15

score 1 · Answer 1 · answered Nov 06 '15 at 12:00

Use DOM even for parsing HTML strings. HTML can be too tricky for a regex to stay effecient.

var s = 'this is a nice day<table><tr><td>aaaa <b>bold</b></td></tr><tr><td>bbbb</td></tr></table> here.';
var doc = document.createDocumentFragment();
var wrapper = document.createElement('myelt');
wrapper.innerHTML = s;
doc.appendChild( wrapper );
arr = [];
var n,walk=document.createTreeWalker(doc,NodeFilter.SHOW_ALL,null,false);
while(n=walk.nextNode())
{
      if (n.nodeName.toUpperCase() === "TD") {
         arr.push(n.innerHTML); 
      }
}
// See it works:
console.log(arr); // or...
for (var r = 0; r < arr.length; r++) {
 document.getElementById("r").innerHTML +=  arr[r] + "<br/>";
}

<div id="r"/>

SamWhan · Answer 2 · 2015-11-06T12:30:17.930

0

You've actually already have the regex needed. It's just your confusing matches with captures. Your regex matches the outer HTML, but it captures the inner. Just do a match and get the first capture group. Check it out in this fiddle.

Here's the code

var s = '<table cellspacing="0px;" cellpadding="8px;"><tr><td align="right" style="padding-right:8px;line-height:18px;vertical-align:top;"><b>Import job summary</b></td><td align="left" style="max-width:300px;line-height:18px;vertical-align:top;"> 5 entries were imported successfully. 0 entries failed to import. </td></tr></table>',
    re = /<td[^>]*>(.*?)<\/td>/g,
    m = s.match(re),
    inner = ['No match'];

if (m.length>0) {
    // You have a capture
    inner = m;
}
document.write( 'Inner is:<br>' + inner.join('<br>') );

Regards

edited Nov 06 '15 at 12:30

answered Nov 06 '15 at 12:07

SamWhan

8,296
1
18
45

I am not against using regex to handle HTML in *some* cases, but this one is definitely not the one. First, `.*?` does not match newlines. Second, even if you use `[^]*?` backtracking buffer can simply be overrun with long HTML strings. Surely with our small examples it will work, but in real-life code, this might cause issues. – Wiktor Stribiżew Nov 06 '15 at 12:17
1

@stribizhev I agree, in *most* cases this is true. However, if you are certain the input isn't going to be to complex (e.g. you generate it yourself) **and** performance is an issue, regex might be a solution (imo ;). – SamWhan Nov 06 '15 at 12:34

Regex for extracting the value in between a html tag

2 Answers2