0

I want to find the index of tag in from the html output of a page in the http module. I am using

HTMLOutput.IndexOf("</head>");

where HTMLOutput is the string parameter which consist the whole html output of a particular page. with the above mentioned method i am able to find the Index of end head tag but only when it is the only end head tag, problem arises when there are some javascript functions within the page which insert some dynamic html content and contains some end head tag within it for example,

newWindow.document.writeln('</head>')

and also if there are some comment lines within the page added by some third party tools which contains within it.

So i am not able to find the index of original tag, does some one have any idea how to tackle this may be some regular expression or something which can help me in this scenario.

Thanks, Mac

Oleks
  • 31,955
  • 11
  • 77
  • 132
Mac
  • 6,991
  • 8
  • 35
  • 67
  • You have to use a HTML parser for this, not a regex. – Qtax Feb 25 '12 at 10:56
  • @Qtax now i am using HTMLAgility pack, can you suggest me how to find the tag – Mac Feb 25 '12 at 12:27
  • You need to write Xpath to find particular element in Html Agility pack.visit here to know more about http://kossovsky.net/index.php/2009/07/csharp-html-parser-htmlagilitypack/ – Manas Feb 25 '12 at 12:31
  • @Mac, I don't know anything about C# HTML parsers so can't help you there. I'm guessing you could use the parser to find the complete `head` element, get the starting position of it in the input string, and the length of its content, and then use those numbers to compute the position of `` (if the parser can't give it to you directly). – Qtax Feb 25 '12 at 15:10
  • @Mac: why do you want to find the index of the end `` tag? Do you want to inject something inside/ouside of it? – Oleks Feb 25 '12 at 21:21
  • ya exactly need to put some script element inside towards the end – Mac Feb 26 '12 at 05:42

2 Answers2

3

You could use Html Agility Pack to find the <head> tag and then inject your <script> element inside it:

var doc = new HtmlDocument();
doc.LoadHtml(HTMLOutput);
var head = doc.DocumentNode.SelectSingleNode("//head");
head.AppendChild(HtmlNode.CreateNode("<script>...</script>"));

To get the result HTML you could just use:

using (StringWriter writer = new StringWriter())
{
    doc.Save(writer);
    HTMLOutput = writer.ToString();
}

Now HTMLOutput variable holds the modified HTML.

Oleks
  • 31,955
  • 11
  • 77
  • 132
  • do i have to save or something as the above changes are not reflecting in the after the page loads – Mac Feb 26 '12 at 16:17
  • what if i go ahead with HTMLOutput = doc.DocumentNode.OuterHtm(), are they same or there are any differences with the part of code that you have updated.Although both are working fine for me, thanks to you – Mac Feb 27 '12 at 10:42
  • 1
    @Mac: practically there is no difference. But in some cases `.OuterHtml` could produce incorrect results. See [this](http://stackoverflow.com/a/5912388/102112) answer. – Oleks Feb 27 '12 at 12:28
2

IF you can make sure all you javascript code lies with in tag, then you can use

HTMLOutput.LastIndexOf("</head>");

But Better is to use "HTMLAgilityPack" and parse your content.

Manas
  • 2,534
  • 20
  • 21