filtering
from html text using regular expression

Question

I am getting an whole html page from an ajax request as text (xmlhttp.responseText)

Then filtering the text to extract a html form from that text and everything inside that form.

I wrote an regex :

text.match(/(<form[\W\w]*<\/form>)/gim)

As i am not an expert in regex, so i cant be sure will it work in every scenario and get everything inside the form tag?

Is there a better way that i can say everything in regex? so that the regex will look like

 text.match(/(<form[__everything_syntaxt_here__]*<\/form>)/gim)

Are you looking for the internal `form` tag stuff, or from ` to ` or both ? — , Jan 29 '15 at 08:31
everything inside the ` ......` tag and also the beginning and end tag too.@sln — Saif, Jan 29 '15 at 08:33
see this:http://stackoverflow.com/questions/4288102/regular-expression-to-grab-form-tag-content-doesnt-work — Suchit kumar, Jan 29 '15 at 08:36
I would discourage you from using regexes for this at all. You can use `responseXML` or make a `documentFragment` or hidden `
` and approach the response as what it is, a HTML page with a DOM tree. So then you simply get `parsedDom.getElementsByTagName('form')[0]` and do what you want with it. — asontu, Jan 29 '15 at 08:41
@funkwurm thanks for your concern. I have tried that and failed as the html comes with so much complex tags,meta tags and internal script the default parser of old browser (currently fighting stupid with IE5 :O ) failed to parse them. That why i am trying to help the old person here. — Saif, Jan 29 '15 at 08:47

score 1 · Answer 1 · answered Jan 29 '15 at 10:18

Try this:

function stripForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
  while (i--) {
    scripts[i].parentNode.removeChild(scripts[i]);
  }
  return div.innerHTML;
}
function getForm(s) {
  var div = document.createElement('div');
  div.innerHTML = s;
  var scripts = div.getElementsByTagName('form');
  var i = scripts.length;
    var ret="";
  while (i--) {
    ret += scripts[i].innerHTML;
  }
  return ret;
}
var a = 'before Form <form action="" method="post"> <input type="text" /> <input type="text" /> <input type="text" /> </form><br/> after form';
alert(getForm(a));
alert(stripForm(a));
console.log(stripForm(a));

Demo

yah its make a good sense. But i think you have noticed that i'have said that the **whole** html page is coming as a response. so it may include tags like `,,,` even internal `scripts` and `style` too. So i don't think it will be a good idea to set the whole text as innetHTML inside a `div` then parse it. — Saif, Jan 29 '15 at 11:12

score 1 · Accepted Answer · edited May 23 '17 at 12:06

1

Having to deal with IE 5, you poor soul.

A quick answer to your question Is [\W\w] really the best way to match absolutely everything?

Yes, JavaScript does not support the s modifier to make . match newlines. Doing [\W\w] basically tells the regex: "Match anything that is a word character, or anything that isn't a word character", you can see that absolutely every character falls in either of those categories.

But, if you want a more reliable solution to deal with  and multiple forms on a page, best approach is something like explained in this SO answer but changed for HTML.

This is what I would use:

<!--(?:(?!-->)[\w\W])*-->|(<form(?:(?:(?!<\/form>|<!--)[\w\W])|(?:<!--(?:(?!-->)[\w\W])*-->))*</form>)

Regular expression visualization

Look at the Debuggex Demo to see what matches you actually get. In JavaScript you can then expect the first capture group. If it's empty then that was just to get rid of the commented form like explained here.

edited May 23 '17 at 12:06

Community

1
1

answered Jan 29 '15 at 10:23

asontu

4,548
1
21
29

its even more than i need. Thanks. – Saif Jan 29 '15 at 11:12
Matches `
` and `
and
` – Jan 29 '15 at 18:55
@sln which is indeed why you shouldn't parse HTML with regexes. But if the use-case is 1 user that is stuck on a slow IE 5 and so you can't use DOM manipulations or a server-side solution, this probably does a good job. [This](https://www.debuggex.com/r/EehQ3iScgXRz1_fA) extended version addresses your specific matches but you'll always run into something. For instance nested forms, improper `
"; `?
– asontu Jan 30 '15 at 09:02

filtering from html text using regular expression

2 Answers2

filtering
from html text using regular expression